Introducing OpenVINO™ 2024.4

Published in

OpenVINO-toolkit

5 min readSep 30, 2024

This release introduces important functional and performance changes across the OpenVINO™ product family, making the optimization and deployment of Large Language Models (LLMs) easier and more performant across all supported scenarios, including edge and data center environments.

On the client side, we have been working hard in previous releases and this one to enable our brand-new Intel® Xe2 GPU architecture, featured in the recently launched Intel® Core™ Ultra Processors (Series 2). The Xe2 architecture is powered by Intel® Xe Matrix Extensions (Intel® XMX) acceleration technology, which we have enabled in collaboration with our partners from the oneDNN and driver teams to achieve peak performance on compute-intensive operations like matrix multiplications. Since matrix multiplication is a key hotspot in LLMs, the performance benefits of using the Xe2 architecture are immediately noticeable when deploying LLMs.

We not only optimized Matrix Multiplications directly via Intel® XMX, but also created highly optimized GPU primitives like Scaled Dot Product Attention and Rotary Positional Embeddings to reduce execution pipeline overhead for those complex operations. We worked on improving memory consumption and more efficient support of models with compressed weights to make deployment laptop/edge friendly and allow LLMs to fit in the smallest memory footprint, which is critical for resource limited environment.

Some of the changes we have made are generic and impact other platforms significantly, including integrated GPUs on platforms (e.g. Intel® Core™ Ultra (Series 1)) and discrete GPUs (Intel® Arc™ family).

With performance and accuracy validation spanning dozens of Large Language Models, we measure these improvements across the entire set of models. Accuracy impacts are tightly controlled using our algorithms for weight compression in the Neural Network Compression Framework (NNCF) optimization framework.

Comparing performance of built-in GPUs, Intel® Core™ Ultra Processor (Series 2) offers up to 1.3X improvement in 2nd token latency compared to Series 1 for LLMs like Llama3–8B and Phi-3-Mini-4k-Instruct, see the chart below.

*Maximize LLM performance on the latest Intel® Core™ Ultra Processor (Series 2) built-in GPUs with the OpenVINO toolkit 2024.4. See Appendix for workloads and configurations. Results may vary.*

Besides GPUs, the Intel® Core™ Ultra Processors (Series 2) processors are introducing a more powerful NPU with 40 TOPS for peak inference throughput, a substantial upgrade from the previous generation. OpenVINO™ now provides access to this acceleration technology for both classical deep learning models (e.g., computer vision, speech recognition and generation) and LLMs via OpenVINO™ GenAI package. We’ve been working with the NPU team on improving performance, reducing memory consumption, and speeding up model compilation for past releases and will continue enhancements in coming releases.

Another popular scenario of using LLMs is through model serving, which means models are accessed via REST APIs and served through frameworks like vLLM or OpenVINO™ Model Server (OVMS). For this usage scenario we are also introducing new features to enhance solution characteristics.

OVMS now serves LLMs through OpenAI APIs and provides the ability to enable a prefix caching feature that improves serving throughput by caching computations for common parts of prompts. This is especially useful when prompts start with same text (e.g. “You are a helpful AI assistant”) or when using LLMs in chat scenario. We also enabled KV Cache compression for CPUs within OVMS, reducing memory consumption and improving metrics like second token latency.

Starting with the OpenVINO™ 2024.4 release, GPUs will support PagedAttention operations and continuous batching, which allows us to use GPUs in LLM serving scenarios. We initially enabled this in our contribution to vLLM and extended it to the OpenVINO™ Model Server in this release. This allows Intel® ARC GPUs to serve LLMs in your environment with optimized serving characteristics. Check out the LLM serving demo for CPU and GPU that shows how you can take advantage of these capabilities.

To continue with Data Center scenarios, OpenVINO™ now provides support for mxfp4, as defined in the Open Compute Project specification, when running on Intel® Xeon® processors. For LLMs, it allows for increased performance on second token latency with memory consumption reduction in comparison with BF16 precision. This is supported by the Neural Network Compression Framework (NNCF) model optimization capabilities that allow for compressing LLM weights into this format.

From a model support perspective, we are continuously working with our partners at Hugging Face to update the Optimum-Intel solution. This allows running models with Hugging Face API while using OpenVINO™ runtime and to efficiently export and compress models for use with OpenVINO GenAI package APIs. In this release, we’ve focused on enabling models like Florence 2, MiniCPM2, Phi-3-Vision, Flux.1, and more. Notebooks are already available to demonstrate how to use these models with OpenVINO™ on the platform of your choice.

Text-to-image generation using Flux.1 and OpenVINO™ with input prompt: *a tiny Yorkshire terrier astronaut hatching from an egg on the moon.*

Over the summer we have been working with our great contributors from Google Summer of code, and the results are encouraging. We have been working to improve Generative AI on ARM platforms, enabling RISC-V and exploring many other exciting developments that we will highlight in more detail soon.

Thank you and we look forward to bringing you more performance improvements and new features in upcoming releases. For more details on this release, see the release notes.

Appendix

+------------------------+
|       Workloads        |  
+------------------------+
| Llama-2-7b-chat        |
| Llama-3-8B             |
| Mistral-7b-V0.1        |
| Phi-3-mini-4k-instruct |
+------------------------+

+----------------+-------------------------+
| Precision      | BIT Default Compression |
| Input tokens   | 1024                    |
| Output tokens  | 128                     |
| Beam search    | 1                       |
| Batch size     | 1                       |
+----------------+-------------------------+


+---------------------------+-------------------------------------------------+-------------------------------------------------+
| CPU Inference Engines:    | Intel® Core™ Ultra Processor (Series 1)         | Intel® Core™ Ultra Processor (Series 2)         |
| Motherboard               | Intel Corporation CRB (Reef Ridge + Astral peak | Intel Corporation Reference Validation Platform |
| CPU                       | Intel® Core™Ultra 7-165H @ 1.8 GHz              | Intel® Core™ Ultra 7-268V @ 2.2 GHz             |
| Hyper Threading           | on                                              | on                                              |
| Turbo Setting             | on                                              | on                                              |
| Memory                    | 2 x 16 GB DDR5 5600MHz                          | On SoC 32 GB LPDDR5 @ 8533 MHz                  |
| Operating System          | Windows 11                                      | Windows 11                                      |
| Kernel version            | 10.0.22631 Build 22631                          | 10.0.22631 Build 22631                          |
| BIOS Vendor               | Intel Corporation                               | Intel Corporation                               |
| BIOS Version              | MTLPEMI1.R00.3471.D56.2403181159                | LNLMFWI1.R00.3221.D83.2408120121                |
| BIOS Release              | 3/18/2024                                       | 8/12/2024                                       |
| Batch size                | 1                                               | 1                                               |
| Test Date                 | 9/6/2024                                        | 9/6/2024                                        |
| Power dissipation/socket, |                                                 |                                                 |
| TDP in Watt               | 28                                              | 17                                              |
+---------------------------+-------------------------------------------------+-------------------------------------------------+

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

Performance varies by use, configuration and other factors. Learn more on the Performance Index site.

Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.

No product or component can be absolutely secure.

Your costs and results may vary.

Introducing OpenVINO™ 2024.4

Appendix

Notices & Disclaimers

Written by OpenVINO™ toolkit