The Need for Speed: Accelerating NLP Inferencing in Spark NLP with OpenVINO™ Runtime

Rajat
OpenVINO-toolkit
Published in
7 min readNov 9, 2023

The Growing Importance of Optimized NLP Model Serving

Natural Language Processing is an important application of AI that enables machines to understand, interpret and derive insights from text in natural language. In recent years, large pre-trained language models (LLMs) like BERT have driven breakthrough improvements in key NLP applications including machine translation, information extraction and more. The rapid adoption of LLMs is transforming what is possible with NLP across many important domains.

However, serving these complex models comes with significant computational overhead. Massive transformer-based models with many millions of parameters can lead to slow inference speeds and insufficient throughput for production systems. This blog post provides a concise overview of how to improve inference performance when running NLP models in Spark NLP by taking advantage of the OpenVINO toolkit. Spark NLP is an enterprise-grade NLP library enabling end-to-end NLP pipelines. By switching to the OpenVINO backend in Spark NLP using a simple API call, users can take advantage of optimized inference execution on Intel hardware. Benchmark results demonstrate OpenVINO can deliver up to 40% faster inference compared to the default TensorFlow backend without any tuning. By combining the scalability of Spark NLP and the optimization capabilities of OpenVINO Runtime, users can build high-performance NLP applications that meet the demands of real-world deployments.

What is Spark NLP?

Spark NLP is a Natural Language Processing library built on top of Apache Spark, an open-source analytics engine optimized for large-scale data processing. This allows Spark NLP to deliver the computational speed and scalability required for industrial NLP applications. The library provides a unified solution to build end-to-end NLP workflows, from transforming raw text into structured features to applying accurate NLP annotations that integrate seamlessly into downstream ML pipelines. Users can leverage Spark NLP’s simple yet powerful API for adding high-performance NLP capabilities to their applications.

Spark NLP provides immediate access to over 17,000+ pre-trained pipelines and models in more than 200+ languages that you can make use of. With the ability to import custom models, the newly introduced ONNX support, and optimization features you can take advantage of during model export through libraries like onnxruntime and optimum, Spark NLP already enables substantial improvements and flexibility when serving LLM models.

Faster Inference with OpenVINO™ Runtime

OpenVINO Runtime is a set of C++ libraries with officially supported C and Python bindings, and an extra module that provides a Java API to deliver an inference solution with support for a range of model frameworks and platforms. Under the hood, OpenVINO utilizes the oneAPI Deep Neural Network Library (oneDNN) to accelerate deep learning workloads. OneDNN is an open-source cross-platform performance library that provides basic building blocks for deep learning applications. These include highly-optimized kernels, which coupled with OpenVINO’s own optimized kernels and several graph optimization techniques leverage the underlying hardware architecture to get maximum CPU and GPU performance for even the most demanding workloads across platforms. A list of supported inference devices can be found here.

OpenVINO is capable of directly reading models in ONNX, PaddlePaddle, Tensorflow and Tensorflow Lite formats without any prior conversion step. The following figure represents a typical workflow for deploying a trained deep-learning model for inference:

Source: OpenVINO™ Runtime User Guide

If you have a model trained in a different framework like PyTorch or need more optimization options, additional tools are also available to convert these models into the OpenVINO Intermediate Representation (IR) format. OpenVINO IR is the model format of OpenVINO produced after converting a model with the Model Conversion API. This representation consists of two model files:

  • .xml file: A file that describes the model topology
  • .bin file: Contains the weights and binary data.

With OpenVINO, you have the option of converting a model ready for inference using the ovc command-line tool, a cross-platform tool to facilitate the transition between training and deployment environments, or directly from the source model object using the Python API. You can even compress and quantize your models to the optimal precision for best performance on your preferred devices and reduce memory footprint. Combining the tools offered by the entire OpenVINO ecosystem enables you to easily optimize custom as well as ready-to-use pre-trained networks to suit your deployment needs.

Preparing your model with OpenVINO

Leveraging OpenVINO™ in Spark NLP

In Spark NLP, models are represented as Annotators that run inference on your input dataframe and append the inference results to it. You can import a custom deep-learning model into the equivalent annotator using the loadSavedModel function offered by the annotator’s companion object. Given the path to the exported model directory also containing any required assets, and the active Spark Session (the entry-point into underlying Spark functionality including getting and setting Spark configurations, creating dataframes from data sources, etc), this function returns the equivalent Spark NLP Annotator that can then be used as part of an NLP pipeline.

Note: loadSavedModel accepts local paths as well as distributed file systems such as HDFS, S3 and DBFS. This feature was introduced in the Spark NLP 4.2.2 release.

Spark NLP uses Tensorflow and ONNX Runtime (from version 5.0) to run underlying deep-learning models. By default, the source model framework is identified automatically, and the corresponding Tensorflow or ONNX backend is invoked. To switch to the OpenVINO backend, simply pass the useOpenvino flag in the function call. For example, to load a custom BERT model into the BertEmbeddings annotator (a simple Annotator that produces word embeddings using the BERT model), all you need is the following Spark statement:

val embeddings = BertEmbeddings.loadSavedModel(MODEL_PATH, spark, useOpenvino = true)

The MODEL_PATH represents the exported model folder, typically of the following structure:

MODEL_PATH
├── assets
│ ├── your-assets-are-here
└── model-files

This converts the source model into the OpenVINO IR format using the read_model + serialize flow, and loads the model as an OpenVINO CompiledModel. If the model is already in the OpenVINO format, it is directly imported and no conversion is performed. Note that you do not need any of the source framework dependencies to be installed here. The OpenVINO Java bindings comes bundled as a lightweight (~50 mb CPU only) package containing the necessary wrappers to load models in various formats and execute optimized inference.

Finally, to measure the performance benefits, we ran a few benchmarking experiments with the bert-large-cased model from HuggingFace using the OpenVINO backend with no additional tuning.

Benchmarks

To perform benchmarking, a simple Scala script that runs a pipeline on a Spark DataFrame and measures the total time taken was run on an Intel 12th-Gen i7–12700H CPU with 14 cores. The eng.testa validation dataset belonging to the CoNLL 2003 dataset- a named entity recognition dataset available through Spark NLP, was transformed using the BertEmbeddings annotator and the total latency was measured across varying sequence lengths and batch sizes.

bert-large-cased model, Batch Size 4
bert-large-cased model, Batch Size 8

In these tests, using the OpenVINO Runtime gives up to 40% performance improvements out-of-the-box without any additional tuning, as compared to the default Tensorflow backend.

Current Integration Status

The OpenVINO and Spark NLP integration described in this post utilizes the OpenVINO Java API and is currently under review to be merged into the main Spark NLP repository soon. While the benchmarks and results highlighted here are based on the OpenVINO Java API integration, they demonstrate the potential performance benefits of using OpenVINO as a backend for optimized model inference in Spark NLP. The reader is encouraged to check for the latest updates on the progress of the OpenVINO integration by following this corresponding GitHub PR.

You can also refer to these instructions to build OpenVINO, these for Spark NLP, and try out the integration yourself in the following notebooks- Bert, Roberta, XlmRoberta.

Conclusion

With the OpenVINO integration, Spark NLP is now able to support importing models of various frameworks with a single model engine backend. You also get to take advantage of the full extent of optimization and quantization capabilities offered by the OpenVINO Toolkit, improve deep learning performance, reduce resource demands and efficiently deploy your custom models on a range of inference platforms. Make sure to follow this GitHub PR to track the status of the OpenVINO integration into Spark-NLP.

Additional Resources:

Notices & Disclaimers:

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex​.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available ​updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.

--

--