DEMO: Generative AI with Intel® OpenVINO™ and Red Hat Open Shift

Intel

Published in

Intel Tech

7 min readFeb 1, 2024

Seamlessly inference from edge to cloud and across devices

Presented by Ria Cheruvu & Dr. Paula Ramos

Generative AI is revolutionizing how organizations create high-quality text and images. However, challenges such as high inference time, unsatisfactory customer experience, and intricated developer experience can pose obstacles to the seamless adoption of generative AI. Developers are often torn between using expensive machines, such as servers or cloud providers, and porting workloads to an edge device with optimized models. While the cloud offers limitless compute power on demand, the edge better supports real-time processing, data sovereignty, and cost efficiency.

With a hybrid AI approach, developers don’t have to choose. By allowing developers to automatically process AI workloads using available or targeted system resources on the edge or in the cloud, hybrid AI combines the unique strengths of the edge and the cloud. In this demo, we’ll show you how to use the OpenVINO™ toolkit and the Red Hat OpenShift Data Science (RHODS) platform to accelerate and deploy Stable Diffusion, a generative AI model, while easily switching where your model inferences in the cloud or at the edge and across Intel® hardware.

About the Demo

This demo leverages OpenVINO™, an open-source toolkit for AI inference. It converts models from different frameworks like TensorFlow or PyTorch into an intermediate representation (IR) format, which can then be optimized and deployed on a wide variety of hardware, including CPUs, GPUs, NPUs, and FPGAs. OpenVINO™, accelerates generative AI performance across many domains, such as reducing the model size, reducing the memory footprint, increasing inference speed, and enabling the flexibility to run workloads on both CPUs and GPUs.

To get the most out of OpenVINO™, we’ll run the demo on the RHODS platform, an AI platform offering based on the open source Open Data Hub project with tools for optimizing the complete life cycle of AI workflows on the cloud. Using RHODS to train, serve, monitor, and manage the life cycle of AI/ML models and applications, developers can spend less time managing AI infrastructure, take advantage of tested and supported AI/ML tooling, unlock flexibility across the hybrid cloud, and leverage Intel-certified drivers and plugins.

Demo: Accelerating a Stable Diffusion 2.0 model

Let’s compare the performance of a Stable Diffusion 2.1 pipeline with and without OpenVINO™, optimizations. You can find this demo and many others on the OpenVINO notebook GitHub repo.

Locating your active device plugins in Red Hat OpenShift

To see which operators are running in your cluster, in the Red Hat OpenShift interface, navigate to the “Installed Operators” menu option and select “Intel Device Plugins Operator.” In this example, we’ve installed an Intel® Data Center GPU Flex 170. You can see more information about your operators by selecting the “Intel GPU Device Plugin” and “YAML” options.

A view of the Intel® Device Plugins Operator in the Red Hat OpenShift dashboard.

Accessing the OpenVINO notebook

You can access your OpenVINO instance in Red Hat OpenShift by selecting the “Routes” option under the “Networking” menu. In the “Routes” dashboard, use the search bar to find the link for your Jupyter Notebook instance, equipped with RHODS and OpenVINO.

arch in the Networking > Routes menu to locate Jupyter Notebooks equipped with RHODS.

Preparing your Stable Diffusion notebook

In a new browser tab, start your notebook server with your Stable Diffusion v2.1 pipeline. We’ve chosen to use the extra-large CPU option with 30 GPUs and added the gpu.intel.com resource so that, within an OpenVINO notebook, we can easily switch running the model on a CPU or GPU.

Optimizing your model

To enhance the model’s compatibility with diverse hardware configurations, optimization is essential. Let’s delve into the architecture of stable diffusion:

When you input a text prompt, the Stable Diffusion pipeline uses three models to generate an image: A text encoder creates a condition to generate an image from the text prompt, the U-NET then provides step-by-step denoising of the latent image representation, and lastly, the autoencoder decodes the latent space and transforms it into an image. The U-NET model is the most computationally expensive to run, and traditional model optimization methods like post-training 8-bit quantization don’t typically work. Using a traditional Stable Diffusion pipeline, inferencing takes more than a few seconds:

The Stable Diffusion pipeline running on a CPU.

OpenVINO helps optimize each of the three models in a Stable Diffusion pipeline, reducing their size and footprint through optimization from FP32 to FP16. With just five lines of code, you can load your model into Hugging Face Optimum, a set of Intel® and Hugging Face APIs that leverages OpenVINO in the back end. Running the same inference with the OpenVINO optimizations applied delivers slightly better results.

The Stable Diffusion pipeline running on a CPU, now with OpenVINO™ optimizations.

Switching inference to a GPU

Now, let’s compile and run the same model on our Intel Flex 170 discrete GPU (dGPU) with FP32 precision. With new features in the latest OpenVINO release, you don’t need to convert your model to FP16 to run it on a GPU — OpenVINO does it for you, allowing you to switch between CPU and GPU quickly. Using a discrete GPU, the model takes less than two seconds.

The Stable Diffusion pipeline with OpenVINO® optimizations is now running on a GPU.

Running Stable Diffusion pipelines at the edge

While we’ve just used OpenVINO remotely to develop and deploy models on Red Hat OpenShift in a production environment, you can also install OpenVINO locally on your own machine, allowing you to run the same notebook on Intel® Arc™ A770 graphics. Just download the pre-converted Stable Diffusion model in an IR format:

Download the preconverted Stable Diffusion model in an IR format to run the same notebook using Intel® Arc™ graphics.

Next, compile the model on your GPU and add a prompt, like in the example below, and the model will generate an image using your edge device in one to two seconds — the speed we want to see in a production environment.

The Stable Diffusion pipeline is now running on an edge GPU.

Get started

Taking advantage of OpenVINO optimizations on Intel hardware enables companies to deploy generative AI with game-changing speed and efficiency. You can also use this framework to accelerate text-to-text models such as Llama 2.0, which you can demo here. To get started, try this demo yourself, learn more about the Red Hat OpenShift platform, or check out the Intel® container playground to test your workload on Intel hardware and explore performance benchmarking with our container playground.

· Demo ready-to-run OpenVINO notebooks

· Explore the Red Hat OpenShift platform

· Test Intel hardware on our container playground

About the authors

Ria Cheruvu, AI Software Architect and Evangelist, Intel

Ria Cheruvu has a master’s degree in data science from Harvard University and is an instructor of data science curricula. Her pathfinding domains include solutions for security and privacy for machine learning, fairness, explainable and responsible AI systems, uncertain AI, reinforcement learning, and computational models of intelligence. Ria has delivered keynotes and technical talks for Women in Data Science, QS EduData Summit, TEDx, DEF CON loT Village, and other communities to inform on challenges and solutions in the space. She is also a published poet, children’s book author, and neuroscience enthusiast.

Paula Ramos, AI Evangelist, Intel

Paula Ramos has a PhD in Computer Vision, with more the 19 years of experience in the technological field. She has been working developing novel integrated engineering technologies, mainly in the field of Computer Vision, robotics and Machine Learning applied to agriculture, since the early 2000s in Colombia. During her PhD and postgrad research, she deployed multiple low-cost, smart edge & IoT computing technologies that can be operated without expertise in computer vision systems such as farmers. The central objective of Paula’s research has been to develop intelligent systems/machines that can understand and recreate the visual world around us to solve real-world needs, such as those in the agricultural industry. Currently, she’s an AI Evangelist at Intel. Here you could find her LinkedIn profile, and her most recent talks.