Optimization for BERT Inference Performance on CPU

Tao Lv

Published in

Apache MXNet

8 min readAug 8, 2019

Author: Shufan Wu, Tao Lv, Pengxin Yuan, Patric Zhao, Jason Ye, Haibin Lin

Updated on 12th/Sep/2019.

Introduction

With its publication in October 2018, BERT (Bidirectional Encoder Representations from Transformers)[1] refreshed the state-of-the-art results on eleven natural language processing tasks and is growing in popularity for a wide range of applications such as language understanding, questions answering, etc.

BERT inherits the skeleton of transformer [2], and introduces a multi-layer bidirectional Transformer encoder. BERT is designed to pre-train deep bidirectional representations from unlabeled text and becomes the first fine-tune based representations model able to achieve state-of-the-art performance on a suite of sentence-level and token-level tasks, because of two innovative pre-training tasks: masked language model (often called MLM) and next sentence prediction. After pre-training, the network is then fine-tuned on specific tasks with minimal changes required to the model architecture.

GluonNLP[4] is a toolkit implemented by MXNet [3] Gluon API, simplifies text pre-processing, datasets loading and neural network construction to speed up NLP research and development without compromising on performance. In GluonNLP 0.7.1 [5], a pre-trained BERT model on 60 GBs of data from texts, the accuracy of the BASE variant is comparable with the BERT LARGE model released by Google. Of particular note, the release also includes 18 other converted models (BERT variants and GPT-2) and scripts for fine-tuning, text generation; users can launch multiple fine-tuning tasks using these scripts.

Apache MXNet 1.5.0 introduces a number of performance optimizations that lead to considerable inference performance gains on BERT. On AWS EC2 C5.24xlarge instance[6], with GluonNLP v0.7.1 and running the MRPC task, up-to ~12.6x/~2.3x speed-up is measured on inference throughput and latency (with low-core(1)). Furthermore, the low precision inference based on quantized BERT model achieves a further 1.8x improvement on inference latency leveraging the Vector Neural Network Instructions (VNNI) [7] of 2nd generation Intel Xeon scalable processors.

Acceleration

Diving into the model architecture of BERT, the Transformer encoder contributes the most to the computational workload. As further analysis on the structure of Multi-Head Attention indicates, these computational load can be broken down into three parts:

● Multiple GEMM for the Keys, Queries, Values and Feed-Forward;

● Two normalization layers;

● One GELU [8] activation, in particular, the computation on GELU activation is reflected by erf operator.

Keeping the maximum reusability and scalability of the optimization, the optimization does not involve the operator fusions targeting the layer structure of the Multi-Head Attention; instead, the performance improvement directly originates from the optimization on computations of GEMM, normalization and erf. Obviously, neural network with similar structures, such as other Attention variants, can also benefit from these optimizations.

**Figure 1 Scaled Dot-Product Attention and Multi-Head Attention (image source** **here**)

Taking the classification task on MRPC dataset as an example and comparing the Apache MXNet 1.4.1 native CPU build to the Apache MXNet 1.5.0 build with optimizations on BERT, the GEMM performance (reflected by the FullyConnected operator) is ~1.6x faster; the normalization layer is ~8.1x faster; and the erf gets ~26.2x faster results on average (2). Table 1 lists the comparison on the corresponding operator time cost per call.

Table 1 Average Operator Time per Call Comparison (shorter the better)

Measuring on network level and comparing to a baseline of Apache MXNet 1.4.1 native CPU build, the Apache MXNet 1.5.0 build with optimizations on BERT gains up-to ~12.6x inference throughput speed-up and ~2.6x latency (3) speed-up on miscellaneous NLP tasks. Table 2 lists the performance comparisons on classification and QA tasks (4). The performance measured on Apache MXNet 1.4.1 set as the baseline, and the data for MXNet-mkl 1.4.1 and MXNet-mkl 1.5.0 with optimization for BERT as relative gains. All the performance numbers are measured on FP32.

Table 2 Gains on Inference Performance of Classification and QA Tasks

**Figure 2 Inference Throughput and Latency Comparison on Classification and QA Tasks**

After requests from users, we measured the real-time inference performance on a “low-core” configuration. These configuration also reflects a scenario found in common production environments. For short sentences(5) on an AWS EC2 C5.2xlarge instance with 4 physical cores, the latency performance can achieve the production requirement from users. Table 3 lists the relative gains on inference latency with MXNet-mkl 1.5.0 with optimizations for BERT and MRPC task with inputs truncated to short length.

Table 3 Real-time Inference Latency on Low-core instance (C5.2xlarge)

Deployment with Intel® DL-Boost

To utilize the VNNI capabilities that are part of the Intel Deep Learning Boost [9] acceleration feature set, we quantized the static BERT model to enable the low precision inference path. At present, by replacing selected FullyConnected layers with their quantized counterpart, the model size decreases by a substantial amount; as well, the inference latency has been improved even further. On an AWS EC2 C5.24xlarge instance with low core configuration, the partially-quantized BERT model size decreases to 186MB. With the quantized model, we can achieve up-to ~1.8x (6) latency improvement over a static FP32 BERT model with a reasonable accuracy loss. Our ongoing work includes more operator quantization and tuning on accuracy; we expect to see even more latency improvements.

Table 4 lists the comparison on inference latency between quantized and FP32 BERT model we achieved at this stage (7) .

Table 4 Inference Latency Comparison on FP32 and INT8 Quantized BERT

Acknowledgement

Thanks for the great support from the Apache MXNet community, the Amazon MXNet team and Intel MKL-DNN team, especially the help and precious suggestions from Sheng Zha, Xinyu Chen, Ciyong Chen, Thom Lane, Vishaal Kapoor, Emily Hutson and Indu Kalyanaraman.

Appendix

Notes:

(1) Low-core refers to using 4 or less physical cores, this is to simulate the configuration of real production environment;

(2) The statistics of average operator time cost per call is measured by MXNet Profiler functionality;

(3) Inference throughput is measured on 24 physical cores (all the cores on NUMA node 0 of a C5.24xlarge instance), latency is measured on 4 physical cores of NUMA node 0;

(4) For MPRC task, the input sentences are padded to max length 128; For SQuAD 1.1 task, the max length of the input sentences is 384;

(5) Short sentence refers to sentence length 5, 10, 20, 50; the input sentences are padded to the corresponding length accordingly;

(6) Low-core refers to using 4 or less physical cores;

(7) The work on low precision deployment is still ongoing and involves un-released SW, the reproduction instructions will be available later.

Step-by-Step Data Reproduction:

Apache MXNET Installation

1) Apache MXNet and Apache MXNet-mkl 1.4.1 can be installed by pip, such as

pip(3) install mxnet==1.4.1
pip(3) install mxnet-mkl==1.4.1

2) Apache MXNet 1.5.0 with optimizations on BERT needs to build from source code;

Get the source code from https://github.com/apache/incubator-mxnet/tree/1.5.0
Downloading latest Intel MKL® 2019.update4 from http://software.intel.com
Build MXNet with Intel MKL®,

make -j USE_MKLDNN=1 USE_BLAS=mkl USE_INTEL_PATH=<YOUR MKL PATH>

3) Install GluonNLP

pip(3) install gluonnlp==0.7.1

Get the BERT-based NLP tasks inference scripts from https://github.com/dmlc/gluon-nlp/tree/master/scripts/bert

Running MRPC classification inference on AWS EC2 C5.24xlarge:

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=24
numact --physcpubind=0-23 --membind=0 python finetune_classifier.py\
                                     --task_name MRPC \
                                     --pad \
                                     --max_len 128 \
                                     --only_inference \
                                     --model_parameters PARAMS \
                                     --dev_batch_size 1

For throughput:

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=24
numact --physcpubind=0-23 --membind=0 python finetune_classifier.py\
                                     --task_name MRPC \
                                     --pad \
                                     --max_len 128 \
                                     --only_inference \
                                     --model_parameters PARAMS \
                                     --dev_batch_size 32

Running SQuAD 1.1 QA inference:

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=24
numactl --physcpubind=0-23 --membind=0 python finetune_squad.py \
                                      --only_predict \
                                      --model_parameter PARAMS \
                                      --test_batch_size 1

For throughput:

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=24
numactl --physcpubind=0-23 --membind=0 python finetune_squad.py \
                                      --only_predict \
                                      --model_parameter PARAMS \
                                      --test_batch_size 24

For latency with short sequence length on AWS EC2 C5.2xlarge:

export KMP_AFFINITY=granularity=fine,noduplicates,compact,1,0
export OMP_NUM_THREADS=4
# Change --max_len to 5, 10, 20, 50 accordingly
numact --physcpubind=0-3 --membind=0 python finetune_classifier.py\
                                     --task_name MRPC \
                                     --pad \
                                     --max_len 5 \
                                     --only_inference \
                                     --model_parameters PARAMS \
                                     --dev_batch_size 1

Note: Please change the PARAMS in above command lines to the corresponding fine-tuned parameter file respectively.

Reference

[1] Devlin, Jacob, et al. “BERT: Pre-training of deep bidirectional transformers for language understanding.” arXiv pre-print arXiv:1810.04805 (2018).

[2] Ashish Vaswani, et al. “Attention is all you need.” arXiv pre-print arXiv: 1706.03762, 2017.

[3] Chen, Tianqi, et al. “MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems.” arXiv pre-print arXiv:1512.01274, 2015.

[4] Jian Guo, et al. “GluonCV and GluonNLP: Deep Learning in Computer Vision and Natural Language Processing.” arXiv pre-print arXiv:1907.04433, (2019).

[5] Faramarz A. Munshi. “GluonNLP v0.7.1 — BERT Reloaded”. https://medium.com/apache-mxnet/gluonnlp-v0-7-1-bert-reloaded-7b9450d33f4b, 2019.

[6] Julien Simon. “Now Available: New C5 instance sizes and bare metal instances.” https://aws.amazon.com/cn/blogs/aws/now-available-new-c5-instance-sizes-and-bare-metal-instances/, 2019.

[7] Banu Nagasundaram. “Vector Neural Network Instructions Enable Int8 AI Inference on Intel Architecture.” https://www.intel.ai/vnni-enables-inference/, 2019.

[8] Dan Hendrycks, Kevin Gimpel. “Gaussian Error Linear Units (GELUs).” arXiv pre-print arXiv:1606.08415, 2016.

[9] Huma Abidi. “Increasing AI Performance and Efficiency with Intel® DL Boost.” https://www.intel.ai/increasing-ai-performance-intel-dlboost/, 2019.

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Performance results are based on testing as of 1st Aug.2019 by AWS and Intel may not reflect all publicly available security updates. No product or component can be absolutely secure. Test Configuration: Software: Apache MXNet 1.5.0, MXNet 1.4.1, MXNet-mkl 1.4.1 and GluonNLP 0.7.1 Hardware: AWS EC2 C5.24xlarge: 2nd generation Intel Xeon Scalable Processors (Cascade Lake). AWS EC2 C5.2xLarge: 1st generation Intel Xeon Scalable Processors (Sky Lake).

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No product or component can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com.

Intel does not control or audit third-party data. You should review this content, consult other sources, and confirm whether referenced data are accurate.

Intel, the Intel logo, and Intel Xeon are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. © Intel Corporation