Introducing OpenVINO 2024.0: Empowering Developers with Enhanced Performance and Expanded Support

Published in

OpenVINO-toolkit

7 min readMar 6, 2024

Welcome to OpenVINO 2024.0, where we are excited to unveil a host of enhancements aimed at empowering developers in the rapidly evolving landscape of AI! This release enhances Large Language Model (LLM) performance with dynamic quantization, improved GPU optimizations, and support for the Mixture of Expert architectures. OpenVINO 2024.0 empowers developers to leverage AI acceleration effectively, with gratitude extended to the community for their ongoing contributions.

Improvements in Large Language Models inference

LLMs show no signs of vanishing, models and use cases keep evolving. We continue our mission to accelerate models and make inference of those models affordable.

Performance and accuracy enhancements

In this release, we have been working on improving out-of-the-box performance for LLMs and made a few important changes in runtime and tools.

First, we have introduced the mechanism of dynamic quantization and kv-cache compression for CPU platforms. KV-cache compression feature allows us to run a large sequence generation more resource-efficient and performantly. Dynamic quantization generally improves compute and memory consumption on other parts of models (embedding projections and feed-forward network). With this update, we have seen a 4.5x improvement for generation latency on ® 8490H and achieved 28.2 tokens/sec for mistral-7b-v0.¹

For a GPU platform, we also improved generation characteristics by introducing optimizations in kernels and across the stack. We have also implemented more efficient cache handling that helps with generation with the use of beam search.As an example, we have achieved up to 8.5 tokens/sec for the llama-2–7b-chat INT4 precision model for the Core™ Ultra7–165H iGPU.¹

Secondly, while performance is always a discussed topic, accuracy is also critical. We improved the accuracy of our algorithms for weight compression within NNCF. We have introduced the ability to compress weights with the use of statistics from datasets and introduced the implementation of the AWQ algorithm to improve accuracy even more.

Moreover, through our integration with Hugging Face Optimum Intel, you can now compress models right through Transformers API as below:

Code source here: https://github.com/huggingface/optimum-intel/pull/538

Note: use of load_in_4bit option that is set to True and the ability to pass quantization_config right in the call to from_pretrained method — this will do all compression work for you. Even more, we have added quantization configs for most popular models already, which include models like Llama2, StableLM, ChatGLM, and QWEN; so, for those models, you don’t need to pass config at all to get 4bit compression.

For more information on the quality of our algorithms, you can always refer to OpenVINO documentation or NNCF documentation on GitHub.

Support for Mixture of Expert architectures

Mixture of Experts represents the next major architecture evolution that brings better accuracy and performance to LLMs. It started with Mixtral and quickly evolved to many more models and frameworks that allow to creation of MoE-based models from existing models.

Throughout the 2024.0 release, we have worked on enabling those architectures and improving performance. Not only have we worked on efficient conversion of those models, but also changed some internals to better handle the dynamic selection of experts within our runtime.

We are in the process of upstreaming our changes to Hugging Face Optimum Intel so conversion of those models is transparent.

Changes for new platforms and enhancements for existing ones

Wider access to Intel NPU

With the release of Intel® Core™ Ultra, our NPU accelerator is finally available to a wide audience of developers. It is an evolving product both from a software and hardware perspective, we are excited by the capabilities that we can achieve with it. You might have seen some demos of OpenVINO™ notebooks running on NPU already.

In this release, we are making NPU support available when you install OpenVINO through our most popular distribution channel, PyPI. A couple of things to note:

- NPU requires drivers to be installed in the system, so if you intend to use it, make sure you follow this short guide.

- NPU is currently not included in the Automatic Device Selection logic, so if you are planning to run your models on NPU, make sure you specify the device name (e.g. NPU) explicitly as follows:

compiled_model = core.compile_model(model=model, device_name=”NPU”)

Improving support for ARM CPUs

Threading was one of the things that we were not implementing efficiently on ARM platforms and that was dragging back our performance. We worked with the oneTBB team (our default threading engine provider) to change ARM support and improved our performance significantly. Simultaneously, we have enabled fp16 as inferencing precision by default on ARM CPUs after some work on the accuracy of certain operations.

Overall, this means higher performance on ARM CPU, but also availability of OpenVINO Streams feature that allows to get higher throughput on multi-core platforms.

Removing some heritage

2024.0 is our next major release and traditionally it is the time when we remove outdated components from our toolkit.

2 years ago, we changed our API dramatically to keep up with the evolution in the Deep Learning space. But to minimize the impact on existing developers and products that use OpenVINO, we were supporting API 1.0. A lot has changed since then, and we are now removing the old API completely. Even more, we are also removing tools that we marked as deprecated. That includes:

- Post-Training Optimization tool, a.k.a. POT.

- Accuracy Check framework

- Deployment manager

Those tools were part of the openvino-dev package, and this package has not been mandatory to use for a while. We will keep it for those users who continue to use our offline model conversion tool, Model Optimizer.

If you were not able to migrate to the new API, chances are high that you can continue using one of our LTS releases, for instance, 2023.3.

New and modified notebooks

We continue to showcase the most important updates in the AI field and how to leverage OpenVINO to accelerate those scenarios. Here is what we have been working on:

· Mobile language assistant with MobileVLM

· Depth estimation with DepthAnything

· Multimodal Large Language Models (MLLM) Kosmos-2

· Zero-shot Image Classification with SigLIP

· Personalized image generation with PhotoMaker

· Voice tone cloning with OpenVoice

· Line-level text detection with Surya

· Zero-shot Identity-Preserving Generation with InstantID

· LLM chatbot and LLM RAG pipeline were updated with the new model’s integration: minicpm-2b-dpo, gemma-7b-it, qwen1.5–7b-chat, baichuan2–7b-chat

Thank you, our developers and contributors!

Over the history of OpenVINO, we have seen many exciting projects! Starting with classic detection of objects and finishing with interview preparation tool. We decided to put together a list of amazing projects that use OpenVINO, and it continues to grow rapidly! Create a pull request with your project, use the “mentioned in Awesome” badge for your project, and share your goodness with us!

Our developer base is growing, we appreciate all the changes and improvements that our community is making. It is amazing to see that some of you have specified that you are “busy with helping to improve OpenVINO”, thank you!

One example of the work that was done by our contributors was OpenVINO support in the openSUSE platform.

In the past weeks, we faced a major issue though — we are not able to populate Good First Issues and review pull requests fast enough! We recognize this problem and will work harder to fix it, stay tuned for more.

Additionally, we are gearing up with Google Summer of Code and getting very interesting proposals for projects from you! There is still time to submit your idea before we send it out for approval.

In this release, a list of our beloved contributors is published on GitHub.

[1] Results may vary. For workloads visit: workloads and for configurations visit: configurations. See also Legal Information.

Notices & Disclaimers

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component can be absolutely secure.

Your costs and results may vary.

Intel technologies may require enabled hardware, software or service activation.

Benchmark System Configuration and Workloads.

System Configuration

Workload Description.

The workload parameters affect the performance results of the models. Models are executed using a batch size of 1. Below are the parameters for the GenAI models.

· Input tokens: 1024,

· Output tokens: 128,

· number of beams: 1

· the tokens for GenAI models are in English.

GitHub repo for the benchmark application used: https://github.com/openvinotoolkit/openvino.genai/tree/master/llm_bench/python