Deploy AI Faster with Intel’s OpenVINO™ Model Server

Published in

OpenVINO-toolkit

9 min read1 hour ago

Author: Paula Ramos — Intel AI Evangelist

As interest in AI skyrockets, AI developers face a pressing need for robust solutions that facilitate seamless deployment, scaling, and optimization of AI models. Enter the OpenVINO™ Model Server (OVMS), a cornerstone in AI development.

OVMS is a scalable, high-performance solution tailored for serving machine learning models optimized for Intel architectures. Crafted in C++, it’s engineered to streamline the execution of inference workloads, reducing complexity and overhead. Moreover, with OpenVINO optimizations, OVMS ensures swifter and more dependable results, empowering developers to achieve optimal performance in their AI projects.

Paula Ramos, AI Evangelist at Intel, recently spoke with software engineers Adrian Tobiszewski and Miłosz Żeglarski from the OpenVINO team to gain insights into the OpenVINO Model Server (OVMS) project and its capabilities. They discussed practical insights, real-world applications, and challenges.

What you’ll learn from the conversation:

How OVMS transitioned from Python to C++ to enhance performance and meet real-world AI demands
Insights into the Directed Acyclic Graph (DAG) scheduler for complex model pipelines and upcoming OpenAI API support
Examples of leveraging OVMS for tasks like object detection and generative AI
Future plans to enhance OVSM with new techniques for large language models, improved efficiency, and better resource utilization
Recommendations, demos, resources, and learning strategies for AI development

These insights are valuable for developers looking to understand practical applications and benefits of OVMS.

Read on for the full conversation:

Paula Ramos: Great to speak to you both. I am looking forward to learning more about OpenVINO Model Server. But first, what can you tell us about yourself and your role at Intel?

Milosz Zeglarski: I’ll start. I’m a software developer at Intel® for OpenVINO and the OpenVINO Model Server team.

I joined Intel about five years ago when I was a student. At the time that I joined, OpenVINO Model Server was still under development. It was just proof of concept managed by a team of three, but over the last couple of years it has slowly evolved into a full project with a development and evaluation team. I mostly develop for the model server, but I also do work around Kubernetes and Python.

Adrian Tobiszewski: My journey into the field started as a C++ software developer developing financial software. When I joined Intel, it was right before this sort of AI boom happened. I worked on Intel® Xeon® Phi series dedicated to high-performance computing, and switched projects when Intel acquired Nirvana with its Crest series of deep-learning accelerators. When Intel turned its focus to Habana, I started working with the OVMS team.

At the time, OVMS was getting more attention because of its ability to rewrite it using C++ and improve performance. Since I had experience with C++, it was a good opportunity for me to get back into development.

Paula Ramos: It’s great to hear the backstory. And now it seems you have a successful team and are doing an amazing job with OpenVINO Model Server. I can see with the AI boom you mentioned, the need to increase capabilities in every single couple, in the cloud, and on the edge. How is OpenVINO Model Server helping facilitate that and aligning with Intel’s commitment to advancing AI technologies?

Adrian Tobiszewski: OVMS really took off when there was a request from our customer. They wanted to achieve the same performance results of OpenVINO while having a serving solution. But at the time, Intel didn’t have its own serving solution for the cloud. So that’s when Intel invested more heavily in model serving, and the rest is history.

Milosz Zeglarski: Exactly, the customer came to us and said, “We are using TensorFlow Serving right now, but we want better performance on CPU with the same API we are using right now. Can you do something like that?” So, there was discussion on whether or not we contribute to TensorFlow Serving using the OpenVINO back-end, or if it was worthwhile to have our own solution.

Our teammates got to work with some proofs-of-concept going on under the hood, and a working version of OpenVINO Model Server was available in four weeks. It was a really agile and quick answer to the customer’s need. And the customer kept coming back with more and more requests to add new features — and with the growing demand of OpenVINO, it became obvious that the serving part was going to follow that.

Paula Ramos: Amazing. I can see a lot of the transformations that went into the project. Can you talk about the latest features and enhancements in OpenVINO Model Server?

Milosz Zeglarski: The OVMS story started with that customer request, and since then we’ve been trying to follow a path of the market trends and requests customers are setting for us.

The first initial version was very small — a tiny layer of serving on top of OpenVINO. Then we added some management features for versioning and loading models from remote storages to improve the model server in Kubernetes, in the cloud deployment, so that it’s easier to quickly serve models.

OpenVINO Model Sever then had this opportunity to download models from external storage. This was the moment that Adrian mentioned earlier when we had to rewrite to C++. We ran some performance tests and concluded that we couldn’t go any further with just the Python version.

Paula Ramos: So that’s when Adrian joined the team to work in C++?

Milosz Zeglarski: Exactly. The newcomers like Adrian were leading this C++ implementation. I was on the team focused on maintaining the Python one while trying to prove we needed the C++ implementation.

The next important feature was KServe API support, which was also driven by the same motivation of our first customer. And in the meantime, we were also working on acyclic graph scheduler.

Adrian Tobiszewski: Directed acyclic graph (DAG) scheduler is a pipelining of models. This allows us to have one model that can do, let’s say object detection, and another model that can perform another task. The idea was that to deliver some kind of engine, you can really connect several models, add custom processing in between, and produce an output that you wanted in the end without sending requests back and forth between the server and client.

Of course you could send single requests, perform the first inference, and respond back and then send another one, but this adds to the latency of the full response. We wanted to keep everything on the server side in just a single pipeline and perform more complex tasks with available models.

Paula Ramos: Does the directed acyclic graph serve the same purposes as MediaPipe?

Adrian Tobiszewski: Back in the day, when we started developing DAG, MediaPipe wasn’t as popular. To some extent, they both solve the same problems, but usage is somewhat different. Each has slightly different limitations, but the idea with both is to keep everything in one pipeline.

Paula Ramos: So, what kind of model pipelines could OpenVINO Model Server support?

Adrian Tobiszewski: One of the most impressive ones is the Holistic demo, which is detecting all the joints in the hands, facial features, and showing you layers on top of your image wherever your hands are and so on. You can record the video or even use the camera and dance in front of the camera to see how it recognizes your posture. That’s the most visually appealing.

Paula Ramos: Great, I want to go back to the AI boom for a second because we have sort of a super AI boom right now with generative AI and large language models. What challenges is OVMS facing right now in relation to this super AI boom?

Milosz Zeglarski: The new AI boom for generative AI has happened quite recently, and we’ve been aware of it pretty much from the beginning. This relates to our latest features as well as our future plans.

Our first approach to language models was just using OVSM for all this conventional AI. We were using DAGs and adding custom preprocessing and post-processing for things like tokenization and detokenization. And that worked, but it’s growing so fast that we needed something more, something better to keep up.

We decided to make it possible for the user to define a Python logic, run it, and execute it on the model server side. This is a big deal because it enables pretty much everything for the user. It also was the thing that first enabled our Stable Diffusion demo, and pretty much all sets of demos for large language models. It was built on MediaPipe, so it works as a node in a pipeline, and can be connected with other nodes.

Paula Ramos: So you have nodes, and you have the flexibility to run multiple nodes, do I understand that correctly?

Milosz Zeglarski: You must have a Python script that implements a certain interface for initializing your model, for executing the stuff that is happening for every request that comes, and for finalization. It’s just the execution part that is required.

OpenVINO Model Server provides the data from the request to your custom Python logic, and you can do whatever you want with it. You can import Optimum, load a stable diffusion model, and just put the data through it. You can have just a little, tiny node that does some pre- and post-processing, and put a C++ node in between, and have a pipeline with Python and C++.

It’s just a quick implementation of one script in Python, and it works — until we have some other solutions. So I’m revealing some of our future plans.

We had these new techniques for LLMs. So, late last year, VLLM came out with some state-of-the-art techniques used for improved efficiency and resource utilization — which leads to better performance overall.

So OpenVINO Model Server will, in the future, have our own implementation for paged attention, for continuous batching, for different scheduling strategies, and for text generation.

Now we had a request for OpenAI API, and this is also something that we plan to deliver in the near future.

A lot of our attention has been directed on these LLMs, and the new, more convenient API. Because OpenAI API is definitely more convenient for language modeling. We have the KServe and TensorFlow APIs, but they are very generic. They’re supposed to work with everything, and it’s not always very convenient for the specialized request for large language models.

Adrian Tobiszewski: A few users have already adopted OpenAI API. So, similar to KServe API, if someone would like to switch, it’s easier for them if we deliver the same API.

Paula Ramos: OVMS is an open-source system, part of the OpenVINO ecosystem. Does the project receive any external contribution, or this is just internal contributions? And how can developers actively engage with that repository if they want to make contributions?

Milosz Zeglarski: Developers can just make a fork, make a change, and create a pull request. We will review it, and, if it’s interesting, hopefully merge it.

For contributions, we deal mostly with internal contributions, but there were also contributions from external organizations.

For instance, we are collaborating with Red Hat for their OpenShift-related services and platform. And OpenVINO Model Server image is certified in Red Hat, so they are also helping us keep up with their security requirements and all that kind of stuff.

Paula Ramos: Before we go, I would love to get your thoughts on how developers can get started with OpenVINO Model Server, and any advice you have for those wanting to join this field.

Milosz Zeglarski: I would start with the demos. Adrian already mentioned the Holistic one. I would recommend the newest one with Python nodes. We have a whole section on the main page of our repository that are marked as new. We have one for Stable Diffusion text generation, and recently a RAG demo.

As for diving into this field in general, I’ve always gone wide, so I’d say learn about AI, deep learning, LLMs, and how stuff works. At the same time, obviously C++ and Python languages, and the operating fundamentals that are important for deployment. Try different things, go wide, and experience different areas.

Adrian Tobiszewski: We live in a great time that you have access to so many resources at will that you can really succeed if you put the work into it. It takes hard work to keep up with this field and you need to put the hours into it.

Additional Resources:

OpenVINO™ Model Server GitHub repository
OpenVINO Documentation
Jupyter Notebooks
Installation and Setup
Product Page

About Paula Ramos:

Paula is an AI Evangelist at Intel. She has been an AI enthusiast and has worked in the computer vision field since the early 2000s. Follow her on Medium for more insights.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.

Deploy AI Faster with Intel’s OpenVINO™ Model Server

What you’ll learn from the conversation:

Additional Resources:

About Paula Ramos:

Notices & Disclaimers

Written by OpenVINO™ toolkit