Natural Language Processing using BERT and OpenVINO™ toolkit

Natural Language Processing (NLP) solutions have impacted the way we build solutions that could model how we share information through speech and language.

OpenVINO™ toolkit

Published in

OpenVINO-toolkit

8 min readFeb 19, 2021

This post was originally published on Intel.com.

Key Takeaways:

Learn how to deliver Natural Language Processing using BERT and the latest release of the Intel® Distribution of OpenVINO™ toolkit with optimizations from fine-tuning techniques to achieve greater performance while maintaining accuracy.
Leverage open-sourced fine-tuning recipes and examples on using quantization on language models.
Get started quickly running pre-optimized and open-sourced models using BERT.

Get started today.

Author: Maxim Shevtsov Senior Deep Learning Engineer, Architecture, Graphics and Software Group

Other Contributors: Yury Gorbachev — Sr. Principal Engineer, Internet of Things Group; Zhenlin Luo — Senior Deep Learning Engineer, Data Platform Group; G, Pallavi — Software Engineer, Data Platform Group; Konstantin Rodyushkin — Deep Learning R&D Engineer, Internet of Things Group; Vasily Shamporov — Deep Learning R&D Engineer, Internet of Things Group

Introduction

Natural Language Processing (NLP) solutions have impacted the way we build solutions that could model how we share information through speech and language. With the advent of deep learning and the introduction of cutting-edge neural networks for NLP, new and emerging use cases are sprouted, from sentiment analysis to customer support, to spelling and grammar correction, to receipt scanning for tracking finances.

Since its original release, the Bidirectional Encoder Representations from Transformers (BERT) has been one of the most popular language models. More recently, numerous improvements on BERT have shown variations like RoBERTa and XLNet that are focused on fine-tuning training processes in order to improve the resulting accuracy, while other variations like DistilliBERT are focused on improving the inference speed.

These models, though are different in design, share the same network architecture as stacked blocks. We cover how the latest release, the Intel® Distribution of OpenVINO™ toolkit 2020.4, has been optimized for the Transformer building blocks to achieve significant performance gains on any BERT-based Natural Language Processing task, by enabling low-precision inference.

With the Intel® Distribution of OpenVINO™ toolkit, you can convert BERT models trained in major frameworks, optionally fine-tune for lower precision, and finally deploy. What’s more, the code for the optimizations within the OpenVINO™ toolkit, now made more practical and which can be used to take BERT into production, is open-sourced.

Finally, we will discuss a simple question answering (QA) demo programmed in Python, powered by BERT optimized using the Intel® Distribution of OpenVINO™ toolkit.

BERT Quantization with the Intel® Distribution of OpenVINO™ toolkit

Transformer-based language models, such as BERT, contain a large number of parameters and computations. The emergence of even larger and complex models for better accuracy suggest a trend of using quantized models in the production environments.

Quantization refers to techniques for doing computations (and storing data) at a lower number of bits than floating-point precision. OpenVINO™ toolkit supports INT8 quantization, allowing up to 4x reduction in the model size and up to four times faster execution compared to the original FP32 model across various models. See configuration information.[CZ1]

OpenVINO™ toolkit supports multiple approaches to quantizing a deep learning network:

In most cases, it is possible to calibrate the model trained in FP32 to the INT8 using the Post-training Optimization Tool.
Quantization-aware training, in contrast, models quantization errors in both the forward and backward passes using fake-quantization modules. This often allows achieving more accurate and performant results when compared to the post-training calibration from the previous bullet.

Specifically for using BERT, we’ve open-sourced our complete fine-tuning recipe for the HuggingFace on PyTorch models. Notice that in the particular example, the fine-tuning (i.e., on the Stanford Question Answering Dataset or SQuAD dataset) is efficiently coupled with quantization. We encourage you to try an example calibration configuration for the TensorFlow models.

Both flows allow for significantly less than 1% accuracy drop compared to the original FP32. We also encourage others to reproduce the results from our published performance benchmarks. To benchmark, you can use the open-sourced benchmark tool on GitHub. See configuration information.

Runtime Optimizations

Transformer models, such as BERT, feature graphs with many layers. The OpenVINO™ toolkit fuses key sub-graphs of multiple elementary operators (e.g., Power, Divide) into single kernels for both CPU and GPU, including LayerNormalization, and GELU layers. In doing this, it significantly reduces memory copy between numerous elementary computations.

Additionally, recognizing GELU as a specific operation (i.e., activation), allows full fusion of the typical Transformer block pattern, which is particularly important for efficient INT8 execution.

Specifically, the most time-consuming sequence from the Transformer block is as follows:

…->FakeQuantize -> MatMul->(Bias)Add->Gelu-> FakeQuantize->…

In the OpenVINO™ toolkit, this sequence (assuming the second input to the MatMul, which is not shown, is also an integer typically represented as weights data) is fused into single INT8 MatMul (with biases) and activation (GELu) as a free “post-op”. In the original sequence, the first FakeQuantize defines the nature of INT8 of the (first) input, and the last nature of INT8 of the output. Notice that both FakeQuantize are inserted by techniques from the previous section during the fine-tuning process. Learn more using the Developer’s Guide on INT8 Inference with OpenVINO™ toolkit.

Try the Pre-trained Models Yourself

For your convenience, the following models are included into the OpenVINO™ toolkit’s Open Model Zoo:

Finally, knowledge distillation that condenses the larger (i.e., FP32) model to a smaller (i.e., FP32) BERT model (without any significant loss in accuracy, while significantly reducing the computation cost) is available on GitHub.

New OpenVINO™ toolkit Demo for Question-Answering

Question Answering (QA) is a very popular way to test the ability of BERTs to understand context. Specifically, for the QA, the BERT is fine-tuned with additional task-specific layers on SQuAD.

The Open Model Zoo repository now comes with a BERT Question Answering Python Demo to input passages (i.e., from the URL) and questions, and to get responses generated by the BERT model in return. Example conversation with the demo using the Wikipedia entry for the Bert Muppet character from Sesame Street:

$bert_question_answering_demo.py — vocab=vocab.txt — 
model=bert_large_squad_model.xml — 
input=https://en.wikipedia.org/wiki/Bert_(Sesame_Street) 
[ INFO ] Initializing Inference Engine
[ INFO ] Loading network files
[ INFO ] Get context from https://en.wikipedia.org/wiki/Bert_(Sesame_Street)
[ INFO ] Context: Bert is a golden yellow Muppet character on the long running PBS and HBO children’s television show Sesame Street. Bert was originally performed by Frank Oz.
Since 1997, Muppeteer Eric Jacobson has been phased in as Bert’s primary performer. Bert has also made cameo appearances within The Muppets franchise, including The Muppet Show, The Muppet Movie, and The Muppets Take Manhattan, in addition to an appearance with Ernie on The Flip Wilson Show on September 14, 1972, on which he sang “Clink, Clank”, a song about noises.[2]
Bert’s age is unclear, but performer on Sesame Street Live, Taylor Morgan, has said that “I just kind of try to think like a six year old or a seven year old, because that’s how old Bert is.”[3]
Type question: who is Bert
[ INFO ] — -answer: 0.83 Bert is a golden
[ INFO ] Bert is a golden yellow Muppet character on the long running PBS and HBO children’s television show Sesame Street
Type question: how old is Bert
[ INFO ] — -answer: 0.71 Bert’s age is unclear
[ INFO ] Bert’s age is unclear, but performer on Sesame Street Live, Taylor Morgan, has said that “I just kind of try to think like a six year old or a seven year old, because that’s how old Bert is
Type question: who was performing Bert
[ INFO ] — -answer: 0.20 Frank Oz
[ INFO ] Bert was originally performed by Frank Oz

Notice that the demo is by no means replaces the OpenVINO’s accuracy_checker, nor the OpenVINO benchmark_app (with respect to the performance). Refer to the demo README at the https://github.com/opencv/open_model_zoo/tree/develop/demos/python_demos/bert_question_answering_demo for further details.

The demo is fully compatible with the pre-trained models from the previous section along with other BERT-based models that have been fine-tuned on the SQuAD.

Conclusions

Developers can now deliver Natural Language Processing using BERT and the latest release of the Intel® Distribution of OpenVINO™ toolkit. Specifically, INT8 optimizations have been achieved with just marginally lower accuracy (compared to the single-precision), while substantially speeding up the inference.

We hope that these optimizations will solve challenges for developers in using BERT as part of applications, while also more generally accelerating BERT deployments within the NLP community.

The OpenVINO™ toolkit code with all scripts and recipes for the BERT can be found on GitHub. We encourage you to try Intel® Distribution of OpenVINO™ toolkit today, and join the conversation!

References

[1] Bert Muppet character from Sesame Street

Notices and Disclaimers

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.

Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Your costs and results may vary.

Intel does not control or audit third-party data. You should consult other sources to evaluate accuracy.

Intel technologies may require enabled hardware, software or service activation.