Improve OpenVINO™ performance on Generative AI workload on ARM devices with

Published in

OpenVINO-toolkit

9 min readSep 5, 2024

Hi, I’m Morteza, an MSc student based in Canada, and I had the incredible opportunity to participate in Google Summer of Code (GSoC) 2024 with the OpenVINO organization. My project focused on improving the performance of Generative AI (GenAI) workloads on ARM devices using OpenVINO. With the rapid growth of AI applications, optimizing these models for various hardware architectures, including ARM devices, has become increasingly crucial.

Project Overview

The core goal of my project was to implement a set of optimizations within the OpenVINO runtime specifically targeting Generative AI tasks such as text generation. The optimizations aimed to improve key performance metrics, including latency and throughput. This optimization was essential for enhancing the deployment of Generative AI workloads on ARM devices, which are widely used across various environments, including mobile, edge, servers, and PCs.

Key Objectives and Expected Outcomes

Porting Optimization Techniques to ARM Architecture: A separate goal of the project was to port existing optimization techniques, which were well-established and optimized for Intel hardware, to ARM architecture. This approach allowed users to take advantage of OpenVINO’s wide-ranging ecosystem and its optimized performance on a broader spectrum of hardware, including mobile, edge, servers, and PCs.

Improved GenAI Workload Adoption: By optimizing OpenVINO for ARM devices, the project aimed to make it easier for developers to adopt GenAI models on these platforms, leveraging the efficient runtime provided by OpenVINO.

Enhanced Performance Metrics: The optimizations focused on reducing latency and increasing throughput for GenAI models.

Contributions and Code

Throughout this project, I made several contributions to the OpenVINO codebase, focusing on optimizing Generative AI workloads for ARM devices. These contributions include code changes, performance improvements, and enhancements to the model compilation process. You can find all my merge requests and detailed changes made during this GSoC project at the following link:

View My GSoC 2024 Merge Requests

Performance Counters for Detailed Analysis

Throughout this project, we extensively utilized OpenVINO’s performance counters to measure the execution time of each operation within Generative AI models. The performance counters provide a CSV file that breaks down the runtime of each operation, offering invaluable insights into how each optimization affects the overall model performance.

Example: Performance Counters of Openvino on a small Phi3 model

This detailed data allowed us to pinpoint bottlenecks, track improvements in latency and throughput, and ensure that our optimizations were having the desired effect on ARM devices. By analyzing the performance counters, we could fine-tune our approach and achieve the best possible results for Generative AI workloads.

For more accurate guidance on using OpenVINO’s performance counters similar to the methods shown above, you can refer to the custom implementation used in this project. This approach involves specific modifications that are not yet part of the main OpenVINO repository.

Alternatively, for general performance monitoring of large language models (LLMs), you can use the llm_bench tool provided by OpenVINO. This tool is specifically designed for benchmarking LLMs and offers a comprehensive set of performance counters and metrics without requiring custom code modifications.

Key Improvements

Now, I will briefly discuss the key optimizations we implemented during the project to enhance the performance of Generative AI models. These improvements were specifically targeted at reducing latency, and improving throughput on ARM devices and tested on my Mac M1. Let’s dive into each optimization and its impact on performance.

RoPE Fusion Operation Optimization

One of the optimizations implemented during this project was enabling the RoPE fusion operation on ARM devices. RoPE, or Rotary Position Embedding, is a technique used in transformer models to incorporate positional information into the model’s input data, which helps the model understand the order of tokens in a sequence. For more information on RoPE operation, you can read this nice post. The fusion operation combines multiple steps of this embedding process into a single, more efficient operation, thereby reducing the computational overhead.

The graph below compares the latency of various operations in a model before and after the implementation of RoPE (Rotary Position Embedding) optimization. The left stacked bar represents the cumulative latency of operations such as Concatenation, Eltwise, MatMul, Reshape, Transpose, StridedSlice, Broadcast, and Math, indicating a significant computational overhead prior to optimization. The right stacked bar shows the reduced latency after RoPE optimization has been applied. Notably, the introduction of the RoPE operation streamlines the overall process, leading to a significant decrease in latency, particularly in the “Math” and “Transpose” operations. This reduction demonstrates the effectiveness of RoPE optimization in enhancing model performance and efficiency.

Comparison of Operation Latency Before and After RoPE Optimization in milliseconds. The stacked bar chart on the left illustrates the latency of various operations (Concatenation, Eltwise, MatMul, Reshape, Transpose, StridedSlice, Broadcast, and Math) before the implementation of RoPE optimization. The chart on the right shows the reduced latency after the integration of RoPE, with a noticeable reduction in the overall operation time, particularly in the “Math” and “Transpose” components, indicating a more efficient computation after optimization.

After enabling RoPE fusion, the second token latency on ARM devices decreased from 169 ms to 122 ms, demonstrating significant improvements in latency and throughput for Generative AI models. These results were obtained using the phi3-mini-4k-instruct model with greedy sampling, a prompt length of 760 tokens, and a batch size of 1. For more details on how these benchmarks were conducted, you can refer to the llm_bench tool provided by OpenVINO.

Scaled Dot Product Attention (SDPA) Optimization

Another significant optimization implemented during this project was the activation of Scaled Dot Product Attention (SDPA) with key-value (KV) cache fusion in OpenVINO. SDPA is a crucial component in transformer models and is responsible for efficiently computing attention scores. For more detailed information on the SDPA operation, you can refer to this blog post. In OpenVINO, there’s a specific optimization pass that fuses the SDPA operation with KV cache usage, enabling efficient non-linear memory access to data stored in the KV cache. This optimization avoids unnecessary copying of the KV-cache tensor during each inference, which is particularly crucial for performance, especially when generating long sequences. This results in reduced computational overhead and improved runtime performance.

We enabled this optimization on ARM architecture, including my Mac M1, to enhance the performance of Generative AI models. Based on testing on my Mac M1, this optimization reduced the time per token from 127.30 ms to 111.88 ms. This improvement demonstrates the effectiveness of fusing SDPA with KV cache in reducing latency and boosting throughput for AI workloads.

Multi-Head Attention Optimization with ARM Compute Library

To further optimize the performance of Generative AI models on ARM architecture, we utilized the ARM Compute Library (ACL) to implement Multi-Head Attention (MHA). MHA is a critical component of transformer models, where multiple attention heads are computed in parallel to capture different contextual information from the input. You can read this blog post for extensive information on how MHA is used in transformers.

For this optimization, we specifically leveraged the General Matrix Multiply (GEMM) kernels provided by ACL to accelerate the computation of attention mechanisms. This approach was particularly effective for the context and computation of key-value (KV) caches, which are crucial operations within the MHA process. By using ACL’s highly optimized GEMM kernels, we achieved a 2x improvement in performance compared to the reference implementation in OpenVINO.

These optimizations were tested on my Mac M1, showing significant reductions in computational time and improving the overall efficiency of running Generative AI workloads.

MHA Single Query Optimization with NEON Vector Extension

In another task, we focused on optimizing the Multi-Head Attention (MHA) single query operation using NEON vector intrinsics on ARM architecture. By leveraging SIMD (Single Instruction, Multiple Data) capabilities through NEON vector extensions, we efficiently accelerated the computation of the next token during the generation phase. This optimization involved refining the dot product and summation operations integral to token generation. As a result, we achieved approximately a 10% improvement in the generation phase, enhancing both latency and throughput for models using this approach.

Attention Softmax Implementation with NEON

We also enhanced the Attention Softmax operation using NEON vector intrinsics on ARM architecture. By applying SIMD (Single Instruction, Multiple Data) techniques, we implemented a specialized numerical algorithm tailored for NEON. This approach yielded modest improvements in performance, optimizing the softmax computation and contributing to overall efficiency gains in attention mechanisms for transformer models.

SDPA Implementation with FP16 Precision

ARM architectures natively support FP16 computations, and OpenVINO can leverage this hardware capability to accelerate inference. However, in the context of large language models (LLMs), the Scaled Dot Product Attention (SDPA) operation lacked FP16 support. In another effort, we addressed this by implementing the SDPA technique for both the generation and prompt phases using FP16 precision. By leveraging NEON vector intrinsics and utilizing ARM Compute Library (ACL) kernels, we optimized SDPA to efficiently support FP16 data. This approach significantly improved performance compared to FP32 precision, achieving speed enhancements of approximately 10% to 20%.

Example prompt

The graphs illustrate the performance gains achieved through our optimizations on a sample prompt “What is OpenVino” on the Phi-3-mini-4k-instruct model. On the left, the bar chart shows a reduction in latency for both the first and second tokens. Initially, the first token latency was 179.89 ms, and the second token latency was 135 ms. After applying optimizations, these latencies were reduced to 160.33 ms and 105.79 ms, respectively, demonstrating significant improvements in processing speed. The right bar chart highlights an increase in throughput from 7.60 tokens per second to 9.63 tokens per second post-optimization. This increase in throughput indicates a higher efficiency in generating tokens, further confirming the optimization’s positive impact on model performance.

Comparison of Latency and Throughput Before and After our Optimizations. The left bar chart illustrates the reduction in latency for the first and second tokens, while the right bar chart shows the improvement in tokens per second throughput, demonstrating the effectiveness of the optimization efforts.

Additionally, we now support FP16 precision for SDPA (Scaled Dot-Product Attention), which was not previously optimized for this operation. While the CPU plugin already supported FP16 on ARM platforms (with FP16 being the default precision), our optimizations specifically enabled FP16 support for SDPA, leading to further performance improvements.

By the end of the project, we achieved improvements in latency and throughput for targeted GenAI models on ARM devices. This made deploying these models on edge environments more practical.

The video below illustrates the impact of various optimization techniques on the OpenVINO phi-3-mini-instruct chatbot’s performance on ARM devices. The left side shows the chatbot before optimization, while the right side shows it after the enhancements. Following the optimization, including enabling more operations in FP16 precision, the chatbot’s response time was significantly reduced, improving user experience. The difference in output sequences is a result of these optimizations, which, due to the use of FP16 precision, introduced slight variations in the response content while still demonstrating improved performance.

Demonstration of the OpenVINO phi-3-mini-instruct chatbot before and after optimization. The left screen shows the initial state, while the right screen displays the chatbot’s improved response time after applying optimization techniques.

Project Highlights and Challenges

During the project, several challenges emerged, including an initial unfamiliarity with the OpenVINO (OV) code base and its extensive monitoring tools. Navigating the vast array of OV’s monitoring tools and the limited documentation for the ARM Compute Library (ACL) required significant effort and trial and error. Extensive code reading was necessary to understand and implement the optimizations effectively. Fortunately, mentors provided invaluable assistance, and OV’s monitoring tools, such as graph dumping, unit tests, and subgraph tests, greatly facilitated the debugging process. Additionally, ChatGPT played a crucial role in helping find the right NEON vector instructions, which streamlined the optimization efforts and contributed to the overall success of the project.

I’d like to extend my heartfelt thanks to my mentors, Alexandr Voron and Dmitry Gorokhov, for their invaluable support throughout this project. Their guidance and expertise were crucial in overcoming challenges and achieving our optimization goals. Their assistance made navigating the complexities of the OpenVINO code base and tools much more manageable, and their encouragement and advice were greatly appreciated. ❤

Working on this project has been a fantastic learning experience. It provided me with a unique opportunity to apply my skills in C++ and work on cutting-edge AI optimization techniques. I also gained a deeper appreciation for the complexities involved in optimizing software for diverse hardware architectures.

Conclusion

I am grateful for the opportunity to contribute to the OpenVINO community and help push the boundaries of what is possible with Generative AI on ARM devices. This project has been a significant milestone in my journey, and I look forward to continuing to contribute to this exciting field.

Thank you for reading about my GSoC 2024 experience. If you’re interested in learning more about the project or have any questions, feel free to reach out!