Model Quantization for Production-Level Neural Network Inference

Patric Zhao
Apr 16 · 8 min read

Author: Patric Zhao, Xinyu Chen, Zhennan Qin, Jason Ye

Update 1st/July/2019, refresh the performance data on Intel® DL Boost (VNNI) enabled instance, AWS EC2 C5.24xlarge.

Introduction

In deep learning, inference is used to deploy a pretrained neural network model to perform image classification, object detection, and other prediction tasks. In the real-world, and especially enterprises, inference is quite important because it is the stage of the analytics pipeline where valuable results are delivered to end users based on their production-level data. The huge number of inference requests from end users are constantly being routed to cloud servers all over the world.

A major measurement of inference performance is latency, or how long it takes to complete a prediction — shorter latency ensures good user experience. And single batch inference is very common in production-level inference, so it is CPU friendly.

When deploying deep learning infrastructure in real production environments, high performance and cost-efficient services are key. Therefore, many cloud service providers (CSPs) and hardware vendors have optimized their services and architectures for inference, such as Amazon SageMaker, Deep Learning AMIs from Amazon Web Services (AWS) and Intel® Deep Learning Boost (Intel® DL Boost), including Vector Neural Network Instructions (VNNI) found in 2nd Generation Intel® Xeon® Scalable processors.

The Apache MXNet* community delivered quantization approaches to improve performance and reduce the deployment costs for inference. There are two main benefits of lower precision (INT8). First, the computation can be accelerated by lower precision instruction, like VNNI. Second, lower precision data types save memory bandwidth and allow for better cache locality and power savings.

The new quantization approach can realize up to 4X performance speedup for the calculation parts with Intel® DL Boost for INT8 inference on 2nd Gen Intel Xeon Scalable processors. Take ResNet50 v1 for example, totally 6.42X performance gain under only 0.38% accuracy drop on AWS EC2 C5.24xlarge, where 1.75X boosts by operator fusion and 3.66X speedup by VNNI acceleration.

Model Quantization

Apache MXNet supports model quantization from float32 to signed INT8 (s8) or unsigned INT8 (u8). S8 is designed for general inference and u8 is specific for CNNs. For most CNNs, ReLU is used as the activation function so output activations are non-negative. Thus, the benefit of u8 is obvious — we can use one more bit for the data to achieve better accuracy.

The INT8 inference pipeline includes two stages based on the trained FP32 models including saved models (JSON file) and parameters.

Fig 1. MXNet int8 inference pipeline
  • Quantization with calibration (offline stage). During this stage, a small fraction of images from the validation dataset (1–5%) will be used for collecting statistical information including naive min/max or optimal thresholds based on entropy theory and defining scaling factors using symmetric quantization and execution profiles of each layer. The output of this stage is a calibrated model including quantized operators saved as a JSON file and a parameter file.
  • INT8 Inference (run-time stage). The quantized and calibrated model should be a pair of a JSON file and a param file which can be loaded and used for inference just like the original model, except with higher speed and less accuracy difference.

Acceleration

Many advanced features are provided by Apache MXNet to accelerate the inference quantization, including the quantized data loader, offline calibration, graph optimization, etc. Apache MXNet is one of the first deep learning frameworks to deliver the fully quantized INT8 network from data loading to compute-intensive operation with production-level quality. In the quantized network, the common computation patterns, like convolution + relu, are fused by a graph optimizer so the whole quantized network is more compact and efficient than the original one. As an example, the ResNet 50 v1 figure below shows the network changes before and after the fusion and quantization.

Fig 2. ResNet50 V1 Architecture (Left: FP32 Right: INT8)

All of these features are transparent to the user when they deploy models on different hardware. In other words, end users don’t need to alter their production code and can get a performance improvement when they switch to a new AWS EC2 instance, such as Intel® DL Boost-enabled instances.

Fig 3. Intel® Deep Learning Boost

Deploy Your Models

Calibration tools and APIs are available for customers to easily quantize their float32 models to INT8 models. Also, Apache MXNet officially provides two kinds of quantization examples: quantization for image classification and object detection (SSD-VGG16). Users can also reference quantization APIs to integrate them in their real-world workloads.

Below, SSD-VGG16 is used as an example to show the implementation and results of MXNet model quantization.

Prepare

Use the following command to install the latest release version of MXNet with Intel® Math Kernel Library for Deep Neural Networks (Intel® MKL-DNN) support.

pip install --pre mxnet-mkl

Follow the Training instructions to train an FP32 SSD-VGG16_reduced_300x300 model based on the Pascal VOC dataset. You can also download our SSD-VGG16 pre-trained model and packed binary data. Create model and data directories if they do not exist, extract the zip files, then rename the uncompressed files as follows.

data/    |--val.rec    |--val.lxt    |--val.idxmodel/    |--ssd_vgg16_reduced_300–0000.params    |--ssd_vgg16_reduced_300-symbol.json

Then, you can utilize the below command to verify the float32 pretrained model:

# USE MKLDNN AS SUBGRAPH BACKEND
export MXNET_SUBGRAPH_BACKEND=MKLDNN
python evaluate.py --cpu --num-batch 10 --batch-size 224 --deploy --prefix=./model/ssd_

Calibration

MXNet provides a calibration script for SSD-VGG16. Users can set different configurations to quantize float32 SSD-VGG16 models to INT8 models, including batch size, number of batches for calibration, calibration mode, quantization destination data type for input data, exclude layers and other configurations for data loaders. We can use the following command for quantization. By default, this script uses five batches (32 samples per batch) for naive calibration.

python quantization.py

After quantization, INT8 models will be saved in the model dictionary as follows.

data/    |--val.rec    |--val.lxt    |--val.idxmodel/    |--ssd_vgg16_reduced_300–0000.params    |--ssd_vgg16_reduced_300-symbol.json    |--cqssd_vgg16_reduced_300–0000.params    |--cqssd_vgg16_reduced_300-symbol.json

Deploy INT8 Inference

Use the following command to launch the inference model.

python evaluate.py --cpu --num-batch 10 --batch-size 224 --deploy --prefix=./model/cqssd_

Detect Visualization

Pick one image from the Pascal VOC2007 validation dataset and the detection results should show as follows. The first image shows the detection result from float32 inference and the second shows the detection result from INT8 inference.

Use the following command to visualize detection.

# Download demo imagepython data/demo/download_demo_images.py# visualize float32 detection
python demo.py --cpu --network vgg16_reduced --data-shape 300 --deploy --thresh 0.4 --prefix=./model/ssd_
# visualize int8 detection
python demo.py --cpu --network vgg16_reduced --data-shape 300 --deploy --thresh 0.4 --prefix=./model/cqssd_
Fig 4.1. SSD-VGG Detection, FP32
Fig 4.2. SSD-VGG Detection, INT8

Performance

In this section, we show the end-to-end performance boost for the inference with Intel DL Boost. Meanwhile, you can find out more quantized models and its performance from Apache/MXNet C++ interface and GluonCV.

The below CPU performance is from an AWS EC2 C5.24xlarge instance with custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake). See complete configuration details in notices and disclaimers.

The total throughput is significantly improved ranging from 6.42X to 4.06X in Figure 5 through operator fusion and model quantization. The benefit of operator fusion is varying with how many common patterns can be fused in the model.

Fig 5. Speedup from operator fusion and quantization

The model quantization delivers more stable speedup over all models, such as 3.66X for ResNet 50 v1, 3.82X for ResNet 101 v1 and 3.77X for SSD-VGG16, which is very close to the theoretical 4X speedup from INT8.

Fig 6. Speedup from the quantization by Intel VNNI instruction

For latency results, in single batch size a lower runtime is better. Most of the models in Figure 7 can be completed in 7 ms, except SSD-VGG16. Especially for the edge-level model MobileNet1, the latency is much better at 1.01 ms. In the real production environment, it is not necessary to use all cores in a CPU for batchsize 1 inference and it is recommended to figure out the tradeoff of the number of cores and the accepted latency like only apply 4 cores.

Fig 7. MXNet* Fusion and Quantization Latency

In addition to the great speedup, the accuracy from Apache/MXNet quantization solution is very close to FP32 models without the request of retaining the mode. In Figure 8, MXNet ensured only a small reduction in accuracy, less than 0.5%.

Fig 8. MXNet* Fusion and Quantization Accuracy

Takeaways

  • Apache MXNet accelerates inference performance using model quantization powered by the Intel MKL-DNN library and Intel Xeon Scalable CPUs.
  • INT8 inference shows great performance improvements for CNN networks, from image classification to object detection.
  • The accuracy of quantized INT8 models is very close to that of FP32 models, normally around 0.5%.
  • Advanced optimizations, such as offline calibration and graph optimization, provide extra performance speedup.
  • 2nd Gen Intel Xeon Scalable processors further boost model performance with Intel DL Boost with new VNNI instruction set, with no impact to the user.

Acknowledgments

Thanks for the great support from the Apache community and the Amazon MXNet team.

Lots of help from Mu Li, Jun Wu, Da Zheng, Ziheng Jiang, Sheng Zha, Anirudh Subramanian, Kim Sukwon, Haibin Lin, Yixin Bao, Emily Hutson and Emily Backus.

Also thanks to the customers of Apache MXNet for providing great feedback.

Appendix

able 1, Raw data on AWS EC2 C5.24xlarge with 1 instance under 2 sockets

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

​Performance results are based on testing as of 1st July 2019 by AWS and may not reflect all publicly available security updates. No product or component can be absolutely secure.

Test Configuration:

Reproduce Script: https://github.com/intel/optimized-models/tree/v1.0.6/mxnet/blog/medium_vnni

Software: Apache MXNet 1.5.0b20190623 and benchmark script commit id f44f6cfbe752fd8b8036307cecf6a30a30ad8557

Hardware: AWS EC2 c5.24xlarge Custom 2nd generation Intel Xeon Scalable Processors (Cascade Lake) with a sustained all core Turbo frequency of 3.6GHz and single core turbo frequency of up to 3.9GHz

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation

Apache MXNet

Apache MXNet (incubating) is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity.

Thanks to Vishaal Kapoor.

Patric Zhao

Written by

Apache MXNet

Apache MXNet (incubating) is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity.