My Journey with Google Summer of Code 2024: Enhancing OpenVINO for RISC-V Devices

BHbean
OpenVINO-toolkit
Published in
11 min readSep 4, 2024

--

Introduction

This summer, I had the incredible opportunity to participate in Google Summer of Code (GSoC) 2024 with the OpenVINO community. The aim of my project was to advance OpenVINO’s support for RISC-V devices, focusing on optimizing runtime performance with the RISC-V Vector Extension (RVV). This blog post reflects on my experiences, challenges, and the outcomes achieved during this exciting journey.

At the beginning

I was introduced to the world of ML through my initial research on AIOps while pursuing my bachelor’s degree. At that time, I was more interested in the underlying systems and hardware that support these complex DL models (like VAE), so I chose to dive deeper into the field of compilers and RISC-V ISA as a master’s candidate.

Out of personal interest, I tried fine-tuning a vertical Chinese LLM focused on Chinese archaeology based on Baichuan2. However, fine-tuning or deploying an LLM on resource-limited devices is quite challenging. This unsuccessful attempt made me aware of the relationship between high-level AI applications and the underlying computational resources, and my research interests shifted toward ML systems.

Nevertheless, I was a complete beginner in this area, so I was eager to learn more about existing open-source ML systems. It just so happened that the organizations for GSoC 2024 had been announced, and I found the OpenVINO™ community to be a perfect choice to start with!

Preparing the application

Once I had chosen the target community, I began preparing my application for GSoC 2024. There are actually several steps before submitting the final proposal to the GSoC website:

Finish the prerequisite task

To contribute to OpenVINO™ during GSoC, I needed to complete a pre-task required by the organization. The community provided a list of GFIs (Good First Issues), and my task was to pick one and fix the selected issue. These issues were not too difficult to fix, as the main goal of this task was to help contributors become more familiar with the code structure of the main repo and the build process. I chose an issue related to the Python API, as I was familiar with this language, and it took me 3~4 days to locate the problem and fix the bug. The related PR has been merged.

Note: The requirements of different communities might vary greatly, so be sure to check the official website of your target community for important information.

Reach out to the mentors

OpenVINO™ provided a list of ideas that could be implemented during GSoC. I was particularly interested in accelerating the inference for existing models, so I mainly focused on these types of ideas. However, after going through all the possible projects on the list, I wasn’t sure what to choose, as none of the projects seemed to match my previous experience. Then, I noticed a discussion about extending OpenVINO with RISC-V support. I was so excited because I had some experience working with RISC-V before and was quite familiar with this ISA. Therefore, I proposed a new idea to help with the porting and optimization of OpenVINO on RISC-V devices, and luckily, one of the mentors showed a positive attitude toward my idea! Thanks, Dima @dmitry.gorokhov!

Note: Besides the potential projects suggested by the community, you are also welcome to propose your own ideas related to the community. So, don’t be afraid to share your thoughts bravely!

Prepare project-related demo

After receiving a positive reply from the mentor, I began considering what I could do to demonstrate my ability in this project. I noticed that no one had attempted to compile OpenVINO for RVV 1.0 (version 1.0 of the RISC-V Vector extension) targets, so I decided to create a demo for it. I cross-compiled OpenVINO for RVV 1.0 devices using the Clang compiler, and the build artifacts were validated on the QEMU simulator. I documented the challenges and bugs I encountered as a guide for others.

Note: I think it is beneficial to complete a small part of the target project or create a small demo before the application. This demonstrates that you are familiar with the repository and capable of solving problems during the coding phase.

Write the proposal

For the Google Summer of Code application, the most important thing to prepare is a proposal that explains why you are a good fit for the project and outlines your future plans for it. OpenVINO™ provided an official proposal template, so I followed it to ensure everything was stated clearly. I used Sawradip’s proposal as a reference to write my own; it was both beautifully written and inspiring. After finishing the draft version, I sent it to Dima via email for review and feedback. His suggestions improved the proposal, and our joint effort helped me become one of the contributors to GSoC 2024. Thank you again, Dima @dmitry.gorokhov!

My proposal is available here for any newcomers to check :)

Note: It is important to stay in contact with your potential mentors before submitting your proposal. Their advice can significantly increase the chances of your proposal being accepted.

Project Overview

OpenVINO is Intel’s toolkit for optimizing and deploying AI inference, and its expansion to RISC-V devices presents a significant challenge due to the unique architecture of these processors. My project focused on porting and optimizing OpenVINO to efficiently run on RISC-V devices equipped with the RISC-V Vector extension (RVV). The primary goal was to improve the inference performance of deep learning models on these devices.

Background

RISC-V, an open and free ISA, has rapidly developed over the past 14 years, leading to numerous practical products such as edge devices, laptops, and servers. The RISC-V Vector Extension (RVV) v1.0, ratified in 2021, marks a significant step towards High-Performance Computing (HPC), with several processors and IP cores already supporting it. The growing support from open-source projects like ncnn, OpenCV, and libjpeg-turbo indicates a positive shift toward this architecture. As RISC-V DSPs are likely to dominate the edge device market, now is an excellent time for OpenVINO to prepare for this wave.

Fig. 1 Performance comparison of three models on Lichee Pi 4A and Raspberry Pi 4 before the GSoC project.

Fig. 1 shows the comparison results for performance on Lichee Pi 4A (RISC-V) and Raspberry Pi 4 (ARM) before the GSoC project, which have similar single- and multi-core performance (see Fig. 2). The unoptimized OpenVINO library on Lichee Pi 4A shows significantly longer execution times than the optimized version for ARM on Raspberry Pi 4. For example, the bert-large-cased model takes 152.0 seconds on Lichee Pi 4A but only 4.2 seconds on Raspberry Pi 4, indicating a 36.2x speed improvement. Similarly, bert-base-ner and t2t-vit-14 models show 37.9x and 47.3x performance boosts, respectively. This highlights the critical impact of hardware-specific optimizations for achieving optimal performance on different architectures.

Fig. 2 Benchmark results of VisionFive2, Lichee Pi 4A and Raspberry Pi 4. (This diagram is taken from the official website of Lichee Pi 4A. Please refer to the link for detailed information.)

Methodology

Before everything begins, the first step should be preparing benchmarking models scope from difference domains. We have carefully selected 10 different models from various fields (shown in Table 1), which allows us to verify OpenVINO functional correctness on wide side of DL tasks.

Table 1 Ten selected models from different domains.

The integration design for the RISC-V architecture in OpenVINO is shown in Fig. 3. This design aims to extend the existing modular structure of the CPU plugin by introducing RISC-V-specific opsets and transformation pipelines. This approach maximizes code reuse by leveraging existing Intel CPU operations and optimizations while providing the flexibility to add RISC-V-specific executors and kernels. The SHL library is an optimized neural network operator library for the RISC-V architecture developed by T-Head, where the kernels can achieve high performance on RISC-V devices. Therefore, we currently choose this library as the underlying kernel for the models and need to implement an executor to bridge these kernels with OpenVINO.

Fig. 3 The overview of the integration design for the RISC-V architecture in OpenVINO.

The whole process of how to optimize operators based on priority is shown in Fig. 4. It begins by using a benchmarking tool (benchmark_app) to evaluate the performance of selected models (such as bert-large-cased, bert-base-ner, and t2t-vit-14) on RISC-V devices, like Lichee Pi 4A. By analyzing the collected data, the most time-consuming or resource-intensive operations are identified as bottlenecks. The next step involves integrating SHL kernels to optimize these specific bottleneck operators for the target hardware, improving overall performance through iterative refinement.

Fig. 4 Steps on how to select and optimize “hot” operators.

Key Achievements

Refining tests on RISC-V

It is noticeable that some of the existing tests failed or were even broken on RISC-V devices, which is mainly caused by a lack of functionality on the target platform (unsupported precisions, layouts) instead of an accuracy problem. Therefore, it is essential to maintain a collection of tests for RISC-V that can correctly validate the functionality of OpenVINO. To address this, I modified the configuration for the OpenVINO test framework and fixed some tests with bugs that could not be skipped correctly. Now, all tests on RISC-V either pass or are skipped (see Fig. 5), which paves the way for future development on this architecture.

Fig. 5 Test results on Lichee Pi 4A.

Integration of element-wise operators

For many models, element-wise operators occupy a large proportion of execution time during inference. To address this, I implemented an element-wise executor to invoke kernels from the SHL library. A list of element-wise operators including Add, Subtract, Multiply, Divide, Maximum, Minimum, Exp, Clamp, Relu and Prelu are supported now. Fig. 6 demonstrates the execution time of different models after the integration of these element-wise operators on Lichee Pi 4A. Compared with the existing version that incorporates OMP techniques for threading as well as SHL FullyConnected kernel, my implementation achieves an average performance boost of 1.11x and up to 1.61x.

Fig. 6 Execution time of various models run by different versions of OpenVINO on Lichee Pi 4A. “Master” means directly porting OpenVINO towards RISC-V backend; “OMP + FC” means using OpenMP library for threading and efficient primitive FullyConnected from SHL; “Eltwise” means incorporating SHL element-wise kernels based on all existing optimization.

Porting OpenVINO for RVV 1.0

Version 1.0 of the RISC-V Vector extension (RVV 1.0) has been ratified by the RISC-V Foundation, meaning that all vendors must follow this standard when designing and implementing their own RISC-V chips. Nonetheless, existing work on RISC-V is still based on the draft RVV 0.7.1, which highlights the necessity of porting OpenVINO to RVV 1.0. During GSoC 2024, I successfully built OpenVINO for the RVV 1.0 target and benchmarked models on the Banana Pi BPI-F3. The results in Fig. 7 show that my implementation achieves an average performance boost of 1.04x and up to 1.51x on this device.

Fig. 7 Execution time of various models run by different versions of OpenVINO on Banana Pi BPI-F3. The meaning of the labels are explained in the previous figure.

It should be noted that for many models, a slight degradation in performance can be observed. This is because the element-wise executor currently only supports the nchw layout, which may lead to a significant number of Reorder operators being inserted between element-wise and other operators due to layout mismatches. These Reorder operators are relatively slow, so one of the future plans is to add support for more layouts such as nhwc in the element-wise executor to mitigate the negative impact of these extra operators.

We also compared the performance of these models on two RISC-V devices after all optimizations. Lichee Pi 4A has 4 cores with a 12-stage out-of-order multiple-issue pipeline and supports RVV 0.7.1, while the Banana Pi BPI-F3 has 8 cores that support RVV 1.0 with an 8-stage in-order dual-issue pipeline. The results (see Fig. 8) show that, in general, performance on the Banana Pi is twice as good as on the Lichee Pi, likely due to the Banana Pi having more cores. However, it is interesting to note that four models (bert-base-ner, t2t-vit-14, mobileclip_text_encoder and whisper-decoder) achieve similar performance on both devices and mobileclip_text_encoder and whisper-decoder perform even better on Lichee Pi. This discrepancy may be attributed to the lack of conv operations in these models. The convoperator is 3–4x faster on the Banana Pi, while other operations, such as MatMuland Eltwise, are slightly faster on the Lichee Pi.

Fig. 8 Comparison of various model performance on two RISC-V devices.

Adding a tutorial on cross-compiling OpenVINO for RISC-V

After completing this work, I summarized my experience, along with Sasha’s (thanks, Sasha @a-sidorova!), in cross-compiling OpenVINO for RISC-V devices into a tutorial for anyone interested. The tutorial describes the build process in detail and should be quite helpful for those looking to undertake similar tasks.

Achievement and Benefits

All of my outcomes during GSoC 2024 are listed here:

Now OpenVINO provides more efficient inference of deep learning models on 64-bit RISC-V devices with the OpenMP support and optimized primitives of FullyConnected, some activation functions and Element-wise operations! To show the current status of OpenVINO on RISC-V, we compare the performance of the models in Fig. 1 again after all the optimizations (shown in Fig. 9). We can see that the performance gap is getting smaller and smaller, which should be contributed to everyone working on OpenVINO for RISC-V. Thank you guys!

Fig. 9 Performance comparison of three models on Lichee Pi 4A and Raspberry Pi 4 after all the optimizations.

Moreover, I have made a small demo demonstrating running t2t-vit-14 classification model on Lichee Pi 4A and Banana Pi BPI-F3:

Conclusion

Participating in GSoC 2024 was an incredibly rewarding experience that allowed me to contribute to a cutting-edge AI framework while honing my skills in system-level programming and optimization. This experience broadened my understanding of low-level optimizations and AI deployment on edge devices, which I believe will be crucial in my future research. I believe the work done this summer will not only benefit the OpenVINO community but also pave the way for more efficient AI deployments on RISC-V platforms.

Acknowledgment

Throughout the project, I had the invaluable support of my mentors Dmitry Gorokhov and Alexandra Sidorova. Their guidance was instrumental in overcoming technical challenges and refining my approach. I would like to extend my heartfelt thanks to my mentors and the OpenVINO community for giving me the chance to be a part of the OpenVINO community! Thanks, Institute of Software, Chinese Academy of Sciences (ISCAS) for providing Banana Pi BPI-F3!

--

--