Introducing OpenVINO 2024.2: Empowering AI Generation with LLM-Specific APIs and Enhanced Serving Capabilities

Published in

OpenVINO-toolkit

9 min readJun 17, 2024

It’s been a few very busy weeks for us, working on improving our product based on your feedback and expanding the ecosystem to cover additional scenarios and use cases.

Let’s review the most important changes that we have made. For a more detailed list, you can always refer to our full [release notes].

Introducing the OpenVINO.GenAI Package and LLM-Specific APIs

Generative AI is being adopted very rapidly amongst application designers. The traditional approach of REST APIs that use models from commercial cloud offerings has been popular for a while, but the client and edge use cases are also rising. More and more data is being processed locally and with AI PCs we are starting to see more and more opportunities for that. One such scenario is AI assistants, capable of generating text, such as mail drafts, document summaries, answers to questions about document content, and many more. It is all powered by both Large Language Models (LLMs) and growing families of Small Language Models.

We have introduced a new package, openvino-genai that uses openvino and openvino_tokenizers underneath, so if you aim to run LLMs, you should install this package. Other types of models are also supported by the classic OpenVINO APIs, so building pipelines is now easier. Our installation options have been updated to reflect and advise on the new package as well, so check out your most suitable option there. The current OpenVINO inference package is still available and if you don’t plan to use Generative APIs for now, just continue using that package.

To produce results via an LLM, an application is required to execute an entire pipeline of actions: perform tokenization of the input text, process the input context, iteratively generate subsequent output tokens of the model answer, and finally decode the answer from tokens to plain text. The generation of each token is an inference call, followed by subsequent logic to select the token itself. The logic could take the form of a greedy search that selects the most probable token or a beam search that maintains a few sequences and selects the best one. While OpenVINO shines at inference, this is not enough to cover the entire generation pipeline. Before the 2024.2 release, we provided some helpers to achieve that (tokenizers and samples) but developers had to implement the entire generation logic using those components. This is changing now.

With the 2024.2 release, we are introducing LLM-specific APIs that hide the complexity of generation loops inside and significantly minimize the amount of code needed for the application to work. With the use of the LLM-specific API, you can load a model, pass a context to it, and get a response back in just a few lines of code. Internally OpenVINO will perform tokenization of the input text, execute the generation loop on the device that you selected, and provide you with an answer. Let’s see how this is done step by step in both Python and C++

Step 1. Export an LLM model via Hugging Face Optimum-Intel (we use a chat-tuned Tiny Llama)

Below are two options for exporting the LLM model in OpenVINO IR format as either FP16 or INT4 precision. To make the LLM inference more performant we recommend using a lower precision for model weights, i.e. INT4, and compress weights using Neural Network Compression Framework (NNCF) during model export directly as seen below.

For FP16:

optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format fp16 --trust-remote-code

For INT4:

optimum-cli export openvino --model "TinyLlama/TinyLlama-1.1B-Chat-v1.0" --weight-format int4 --trust-remote-code

Step 2. Perform generation using the model in C++ or Python

Via the new C++ API for LLM Generation:

#include "openvino/genai/llm_pipeline.hpp"

#include <iostream>

int main(int argc, char* argv[]) {

std::string model_path = argv[1];

ov::genai::LLMPipeline pipe(model_path, "CPU");//target device is CPU

std::cout << pipe.generate("The Sun is yellow because"); //input context

}

Via the new Python API for Generation:

import openvino_genai as ov_genai

pipe = ov_genai.LLMPipeline(model_path, "CPU")

print(pipe.generate("The Sun is yellow because"))

As you can see, it only takes a few lines of code to build an LLM generation pipeline now. This is because once the model is exported from Hugging Face Optimum-Intel, it already contains all the necessary information for execution, including the tokenizer/detokenizer and the generation config ensuring that its results match Hugging Face generation. We provide both C++ and Python APIs to run LLMs, with a minimal list of dependencies and additions to your application.

To implement more interactive UIs for generation, we have added support for streaming model output tokens. In the example below, we use a simplistic lambda function to output words to the console as soon as they are generated by the model:

#include "openvino/genai/llm_pipeline.hpp"

#include <iostream>

int main(int argc, char* argv[]) {

std::string model_path = argv[1];

ov::genai::LLMPipeline pipe(model_path, "CPU");

auto streamer = [](std::string word) { std::cout << word << std::flush; };

std::cout << pipe.generate("The Sun is yellow because", streamer);

}

It is possible to create a custom streamer for more sophisticated processing as well, this is described in our [documentation].

Finally, we have also worked on a chat scenario where inputs and outputs represent a conversation, providing the opportunity for optimization by preserving KVCache between inputs. To do that, we have introduced two chat-specific methods, start_chat and finish_chat, that mark a conversation session. Here is a very simple C++ example:

int main(int argc, char* argv[]) {

std::string prompt;

std::string model_path = argv[1];

ov::genai::LLMPipeline pipe(model_path, "CPU");

pipe.start_chat();

for (;;) {

std::cout << "question:\n";

std::getline(std::cin, prompt);

if (prompt == "Stop!")

break;

std::cout << "answer:\n";

auto answer = pipe(prompt);

std::cout << answer << std::endl;

}

pipe.finish_chat();

}

In all the examples above we use the CPU as a target device, but both CPU and GPU can be used to perform inference for LLMs, however, the token selection logic and tokenization/detokenization will remain on the CPU. Tokenizers underneath are represented in the form of a separate model and run on CPU using our inference capabilities.

This API allows us to have a more flexible and optimized implementation of the generation logic, which we will continue to expand. Stay tuned for more features in coming releases!

Meanwhile make sure to check our [documentation] and [samples] for the new API, give it a try, and let us know what you think.

Expanding Serving of Models via OpenVINO

Deployment of models via serving is a well-established paradigm and increasingly in demand, as microservice-based deployments are scaling to the edge in addition to the cloud. More apps are being created as microservices and deployed at the edge and cloud. With the 2024.2 release, we are introducing additional support for serving scenarios. Let’s walk through the most important changes.

OpenVINO Model Server (OVMS) is a long-standing solution for model serving. It has been widely adopted by applications to serve models most efficiently. In this release, we are introducing the ability to serve LLMs with high efficiency via a mechanism called continuous batching.

Essentially, continuous batching allows us to implement inference for serving most efficiently by combining multiple requests into batches. Traditional batching works with text generation scenarios in a very limited way due to differences in context sizes during generation. It is practically impossible to find two different requests of the same length and produce outputs of the same length at the same time, to perform traditional request batching. To address this, we adopt the Paged Attention approach, like in the vLLM implementation. This allows us to combine multiple requests to the same model and increase the efficiency of hardware utilization. However, the internal logic for scheduling requests is different. We considered the specifics of the CPU when designing it to be more efficient, combining high throughput and lower latency.

To serve LLMs in the most application-friendly way we have implemented an OpenAI-compatible API for text generation use cases. Our implementation includes continuous batching and paged attention algorithms so text generation can be fast and efficient in high concurrency load. This allows you to make your own OpenAI-like LLM serving endpoint within the cloud or on-premises if required. Check out the [sample] to learn how.

Despite the whole LLM hype, classic Deep Learning models are highly demanded as both standalone solutions and parts of bigger pipelines. OVMS has been deploying those models efficiently for a long time but the demand for other deployment solutions was so high that we decided to introduce a few additional integrations of OpenVINO for serving scenarios: serving via TorchServe* and Nvidia Triton Inference Server*.

Another important addition to serving abilities is serving models with TorchServe by using OpenVINO backend for the torch.compile. Right after the introduction of torch.compile, TorchServe launched the ability to accelerate serving via different backends. OpenVINO is now one of the supported backends that can be specified. For more details you can check our examples, they are quite simple and self-explanatory.

Performance Improvements

The performance of AI models continues to remain our focus. The widespread adoption of LLMs by clients is placing pressure on the limits of the underlying hardware even with the emergence of AI PC. Our optimization efforts span various supported targets, including CPU, GPU, and NPU.

AI PCs differ from previous generation PCs due to the integration of specialized hardware accelerators and are growing in importance as AI use cases shift from cloud to personal computing. Intel® Core™ Ultra processors offer more powerful GPUs along with NPUs. From the standpoint of performance and efficiency, those are attractive targets to use to accelerate solutions.

If platform performance is not sufficient, it is possible to add discrete acceleration in the form of our ARC dGPUs. To help with LLM deployment characteristics we have been focusing on accelerating LLMs for GPUs, covering both integrated and discrete flavors. Offloading to the GPU is done not only because of characteristics but also because of the need to keep the CPU available. So, CPU load during inference is critical for such cases. We have been working to optimize CPU-side load and have reduced host code latency at least by a half. This allowed us to achieve better GPU characteristics as well, as kernel scheduling is now more efficient.

Additionally, we have worked on a more efficient implementation of a few GPU primitives, including a fused version of Scaled Dot Product Attention and Positional Embeddings. This improves not only latency but also reduces the host’s overhead and overall memory consumption during inference, which is critical for scenarios like running LLMs on laptops.

Latency on discrete GPU has decreased for some LLMs and we continue our optimization journey together with partners from the oneDNN team.

While we talked about GPU a lot, other targets like CPU have improved as well. On CPU there was a significant improvement in 2nd token latency and memory footprint of FP16 weight LLMs on AVX2 (13th Gen Intel® Core™ processors) and AVX512 (3rd Gen Intel® Xeon® Scalable Processors) based CPU platforms, particularly for small batch size. Not to mention a constantly growing coverage for new models within our Optimum-Intel integration.

New Models, Notebooks, and Samples

Each release we continue to expand support for new models as well as new notebooks with use cases on how to leverage OpenVINO. For new models, we added support for mil-nce and openimages-v4-ssd-mobilenet-v2 from TensorFlow* Hub, as well as Phi-3-mini: a family of AI models that leverages the power of small language models for faster, more accurate and cost-effective text processing.

Notebooks can be valuable content for users to learn and experiment with. In this release, we have added several new notebooks. Most notable are the DynamiCrafter notebook to animate images, a notebook to convert and optimize YOLOv10 for OpenVINO, and adding Phi-3-mini and Qwen2 models to the existing LLMChatbot notebook so users can play around with even more LLM models.

The full list of notebooks that have been updated or newly added:

· Image to Video Generation with Stable Video Diffusion
· Image generation with Stable Cascade
· One Step Sketch to Image translation with pix2pix-turbo and OpenVINO
· Animating Open-domain Images with DynamiCrafter and OpenVINO
· Text-to-Video retrieval with S3D MIL-NCE and OpenVINO
· Convert and Optimize YOLOv10 with OpenVINO
· Visual-language assistant with nanoLLaVA and OpenVINO
· Person Counting System using YOLOV8 and OpenVINO™
· Quantization-Sparsity Aware Training with NNCF, using PyTorch framework
· Create an LLM-powered Chatbot using OpenVINO

Conclusion

With this, we are excited to announce the latest release of OpenVINO 2024.2 is available now! Our team has been working hard on all the new features and performance improvements. As always, we strive to continue enhancing the user experience and expanding the capabilities of OpenVINO. Our development roadmap is already filled with features for our next release that we can’t wait to show you as well. Thank you!

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.