PyTorch Inference Acceleration with Intel® Neural Compressor

Published in

PyTorch

5 min readJun 28, 2022

Authors: Feng Tian, Haihao Shen, Huma Abidi, Chandan Damannagari

Overview

Intel® Neural Compressor is an open-source python library for model compression that reduces the model size and increases the speed of deep learning inference for deployment on CPUs or GPUs. It provides unified interfaces across multiple deep learning frameworks for popular network compression technologies, such as quantization, pruning, and knowledge distillation. This tool supports automatic accuracy-driven tuning strategies to help the user quickly find the best-quantized model. It also implements different weight pruning algorithms to generate pruned models using a predefined sparsity goal, and supports knowledge distillation to distill the knowledge from the teacher model to the student model. Intel® Neural Compressor provides APIs for a range of deep learning frameworks including TensorFlow, PyTorch, and MXNet in addition to ONNX runtime for greater interoperability across frameworks. This blog is focused on the benefits of using the tool with a PyTorch model.

Quantization

Intel® Neural Compressor extends PyTorch quantization by providing advanced recipes for quantization and automatic mixed precision, and accuracy-aware tuning. It takes a PyTorch model as input and yields an optimal model accordingly.

Fine-grained Quantization

Intel® Neural Compressor’s quantization capability is built upon the standard PyTorch quantization API and makes its own modifications to support fine-grained quantization granularity from the model level to the operator level. This approach gives users the ability to get better accuracy without additional hand-tuned work.

Advanced Automatic Mixed-Precision

Intel® Neural Compressor further extends the scope of the PyTorch Automatic Mixed Precision (AMP) feature on 3rd Gen Intel® Xeon® Scalable Processors. Compared with vanilla PyTorch AMP implementation, the tool also supports INT8 in addition to BF16 and FP32. As shown in the diagram below, it first converts all the quantizable operators from FP32 to INT8, and then converts the remaining FP32 operators to BF16 operators if BF16 kernels are supported on PyTorch and accelerated by underlying HW.

Accuracy-aware Tuning

Intel® Neural Compressor also supports an automatic accuracy-aware tuning mechanism for better quantization productivity. As the first step, the tool queries the framework for the quantization capabilities. , such as quantization granularity (per_tensor or per_channel), quantization scheme (symmetric or asymmetric), quantization data type (u8 or s8), and calibration approach (min-max or KL divergence). Then it queries the supported data types for each operator. With these queried capabilities, the tool generates a whole tuning space of different sets of quantization configurations and starts the tuning iterations. For each set of quantization configurations, it performs the calibration, quantization, and evaluation. Once the evaluation meets the accuracy goal, the tool terminates the tuning process and produces a quantized model.

Pruning

The pruning part of Intel® Neural Compressor is mainly focused on unstructured and structured weight pruning, and filter pruning. Unstructured pruning uses a magnitude pruning algorithm, which prunes the weights during training when their magnitude is below a predefined threshold. Structured pruning implements experimental tile-wise sparsity kernels to boost the performance of the sparsity model. And filter pruning implements a gradient-sensitivity pruning algorithm, which prunes the head, intermediate layers, and hidden states in the NLP model according to the importance score calculated by the gradient.

Distillation

Intel® Neural Compressor also implements a knowledge distillation algorithm to transfer knowledge from a large “teacher” model to a smaller “student” model without loss of validity. As indicated below, the same input is fed to both models, and the student model learns by comparing its results to both the teacher and the ground-truth label.

Example

Below is one example to show how to quantize an NLP model with Intel® Neural Compressor.

Note that the generated mixed-precision model may vary, depending on the capabilities of the low precision kernels and underlying hardware (e.g., INT8/BF16/FP32 mixed-precision model on 3rd Gen Intel® Xeon® Scalable Processors).

Performance Results

Intel® Neural Compressor has validated 400+ examples with a performance speedup geomean of 2.2x on an Intel® Xeon® Platinum 8380 Processor with minimal accuracy loss. More details for validated models are available here.

*Test configuration: Test by Intel as of 6/10/2022 processor: 2S Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz, 40-core/80-thread, Turbo Boost on, Hyper-Threading on; memory: 256GB (16x16GB DDR4 3200MT/s); storage: Intel® SSD *1; NIC: 2x Ethernet Controller 10G X550T; BIOS: SE5C6200.86B.0022.D64.2105220049(ucode:0xd0002b1)；OS: Ubuntu 20.04.1 LTS; Kernel: 5.4.0–42-generic; Batch Size: 1; Core per Instance: 4;

Summary and Future Work

The vision of Intel® Neural Compressor is to improve productivity and solve the issues of accuracy loss by an auto-tuning mechanism and an easy-to-use API when applying popular neural network compression approaches. We are continuously improving this tool by adding more compression recipes and combining those techniques to produce optimal models. We invite users to try Intel® Neural Compressor and send us feedback through Github issues. We also welcome any contributions to Intel® Neural Compressor Github repo.

Acknowledgment

We would like to thank Xin He, Chang Wang, Wenxin Zhang, Penghui Cheng, and Suyue Chen for their contributions to Intel® Neural Compressor. We also offer a special thanks to Eric Lin, Jianhui Li, and Jiong Gong for their technical discussions and insights, and collaborators from Meta for their professional support and guidance. Finally, we would like to thank Wei Li, Andres Rodriguez, and Honesty Young for their great support.