Introducing OpenVINO 2024.3: Enhanced LLM Performance

Published in

OpenVINO-toolkit

3 min readAug 8, 2024

We are pleased to announce that OpenVINO™ 2024.3 is now available! This update brings new features and enhancements especially to Large Language Model (LLM) performance. We will go over the key improvements in this release, for a full list refer to the release notes.

Models on Hugging Face

Hugging Face continues to rise in popularity as the go-to platform for discovering and acquiring AI models. You can now find a selection of OpenVINO pre-optimized models on Hugging Face, making it easier to access and run models quickly. This includes different precisions of models like Phi-3, Mistral, Mixtral, LCM Dreamshaper, starcoder2 and more; for all models see Hugging Face. In the model cards you can find more information about each model including a description and how to run model inference with Optimum Intel or with the OpenVINO GenAI package. This addition aims to enhance AI model accessibility and accelerate model integration and deployment.

Performance Improvements

Improved LLM Performance on Intel Discrete GPUs

Intel’s lineup of discrete GPUs provides accelerated processing power for computationally intensive AI tasks. In this release, we have targeted enhancing performance for LLMs and other models on discrete GPUs. Compared to our first release this year of 2024.1, 1st token latency for this release has performance gains between 1.9x and 6.8x on Intel® Arc™ Discrete GPUs. For 2nd token throughput the performance gains are between 2x and 2.9x for Intel® Arc™ Discrete GPUs compared to this year’s first release. These improvements were achieved through optimizations with Multi-Head Attention (MHA) and OneDNN enhancements. Other model performance improvements on discrete GPUs in this release include Stable Diffusion and Whisper models. Particularly Stable Diffusion model between 1.1x and 1.6x performance gains over previous release for image generation time.

2nd Token Latency. Tokens per second. Higher is better.

ChatGLM2–6B, Llama-2–7b-chat and Mistral-7b-v0.1 : 2nd Token Latency as Tokens Per Second. Input tokens: 1024 | Output token: 128 | Beam search: 1 | Batch size: 1, Precision: INT4
Falcon-7b-instruct — Metric: 2nd Token Latency as Tokens Per Second. Input tokens: 32 | Output token: 128 | Beam search: 1 | Batch size: 1, Precision: INT4
For more measurement and system configuration details please visit:: https://edc.intel.com/content/www/us/en/products/performance/benchmarks/mobile_1/ Performance varies by use, configuration and other factors. Learn more at intel.com/PerformanceIndex. Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates.

Improved CPU Performance When Serving LLMs

vLLM (very Large Language Model), an open-source library for LLM inferencing and model serving, has gained traction in the AI community since its introduction with its innovative techniques to enhance LLM inference performance and memory efficiency. In this release OpenVINO is now integrated with vLLM and continuous batching leading to improved CPU performance when serving LLMs. OpenVINO is leveraging vLLM techniques of fully connected layer optimization, fusing multiple fully-connected layers (MLP), U8 KV cache, and dynamic split fuse that all work together to increase inference speeds and reduce memory usage. For example, in scenarios focused on maximizing throughput the computation requirements of fully-connected layers can potentially match or exceed memory bounds when batch sizes are large. In these situations, fusing multiple fully-connected layers (MLP) enables more efficient use of memory bandwidth and increases the number calculations performed per memory access. You can take advantage of these new features using OpenVINO Model Server (OVMS) or OpenVINO backend in vLLM. Check out the OVMS sample and for OpenVINO as a backend for vLLM see the installation guide.

Conclusion

As always, we value your feedback and contributions to help continuously improve OpenVINO. With each release we look forward to seeing the new and creative ways you use OpenVINO to advance your AI initiatives. Thank you!

Additional Resources

OpenVINO Documentation
Jupyter Notebooks
Installation and Setup
Product Page

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.