Think, Say, See it! Chaining Multi-Modal Generative AI Models on CPUs and GPUs

From Speech Transcription to Image Generation in Seconds

OpenVINO™ toolkit

Published in

OpenVINO-toolkit

6 min readNov 22, 2023

Author: Ria Cheruvu — AI SW Architect and Gen AI Evangelist

Credits: Paula Ramos, Raymond Lo, and the OpenVINO Notebooks team!

Generative AI models are increasingly getting better at achieving complex functionalities. With the creation of more efficient and responsive computing ecosystems, exciting opportunities are quickly being unlocked. Chaining multiple Generative AI models together is one of them!

In this article, we’ll look at how to run multi-modal and multi-model Generative AI models. Particularly, four (4) state-of-the-art models running speech transcription, prompt refinement, image generation, and explainable AI all in one pipeline, and in seconds with OpenVINO™. OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference.

Along the way, we’ll explore the compute angle, walking through how to take advantage of CPUs and GPUs to accomplish this.

Here’s a quick sneak peek of what the final pipeline looks like!

First, let’s briefly talk about the compute ecosystem and why it matters.

Generative AI’s Compute Ecosystem

Today, there are several different options to run Generative AI workloads. At one end of the ecosystem, we have the cloud, enabling (i) large amounts of data, (ii) limitless compute on demand with powerful and efficient machines, and (iii) the centralization of workloads and data.

On another end, edge, and client devices (such as laptops, mobile devices, and medical devices) are becoming a popular choice for running Generative AI workloads. These devices offer (i) real-time data processing capabilities, (ii) wider reach even without connectivity, (iii) data sovereignty, and (iv) cost efficiency.

For example, we might:

- Use the cloud for extremely large and compute-intensive models, and for data we’re comfortable uploading and storing in the cloud.

- Use the edge for light and medium-sized models and keeping data locally on our system.

But if what we could combine the strengths of these two options? Hybrid AI is a new concept that attempts to do just this.

➢Hybrid AI is a paradigm where the processing of an AI workload can take place using available system resources and accelerators on the edge, or in the cloud, without the need to manually recode our applications. This allows us to write once and deploy anywhere with ease!

Using Hybrid AI to Chain Generative AI Models

We can use Hybrid AI to offload compute-intensive workloads to our GPUs (e.g., on the cloud), and keep lightweight workloads on our CPUs (the edge or client).

Let’s turn to a pipeline with OpenVINO™runtime, where we chain four Generative AI models with different modalities together (hence why it’s multi-modal and multi-model).

Our objective with this pipeline is to think, say, and see it — essentially turning speech into an image in seconds!

Models:

1. Whisper for transcribing speech

2. RedPajama-INCITE-Chat (3 billion parameters, int8 precision) for refining the transcribed text

3. Stable Diffusion (versions 2.1 and XL) for using the text as a prompt for image generation

4. CLIP for saliency map generation to explore the interpretability of the generated image. We can then ask questions such as “Where is the brightest part of the image?” or “Where is the dog in the image?”, and this model will generate a saliency map, pointing to the area of the image that answers the question.

*Figure 1: Chaining 4 Generative AI models with OpenVINO™.*

Stable Diffusion is notably a much heavier workload than the other models, particularly Stable Diffusion XL (with 3.5 billion parameters). We can use Hybrid AI to split our workload across an edge and cloud setup, running our Whisper, RedPajama-INCITE, and CLIP models on our 2 CPUs, while the Stable Diffusion model is run on a GPU.

For our hardware, we use:

● 2 4th Gen Intel® Xeon® Scalable CPUs for our edge setup.

● 1 Intel® Data Center GPU Flex 170 GPU for the cloud setup.

Memory requirements: 16GB RAM

We can also further optimize our Generative AI models to reduce their model size and footprint, making even the largest of models in this pipeline more lightweight.

Detailed Steps: Running Generative AI with OpenVINO™

OpenVINO™ helps with taking advantage of our compute, allowing us to specify a target device if we choose to manually target system resources, or use the Auto plugin to automatically select the optimal device.

Let’s explore a few detailed steps on the pipeline, and what OpenVINO™ is optimizing under the hood here:

Step 1: Load the models

First, we load our Whisper and CLIP models from their sources using OpenVINO™ Notebooks. For the RedPajama-Incite and Stable Diffusion variants, we load these models from HuggingFace, using the Optimum-Intel package (as shown in Figure 2).

Optimizations:

Using OpenVINO™, we can optimize our models by converting them to FP16 precision or further quantizing our models to int8 precision (as we can do with Whisper and RedPajama-INCITE).

For example, for Stable Diffusion, we convert each of the three models, Text Encoder, Latent U-Net, and the VAE Decoder, into OpenVINO™’s Intermediate Representation Format and to FP16 precision.

This means that the overall model’s size goes from 4.9G with FP32 precision to 2.4G with FP16 precision! (see this video here for more details).

You can also check out Figure 2 for the code we can use to do this for Stable Diffusion 2.1.

*Figure 2: Snippet — Downloading an optimized Stable Diffusion model from HuggingFace Optimum-Intel, and converting the precision to FP16 via* ***sd_pipe.half()***.

Step 2: Compile the models

Next, we compile the models. When defining the instantiation of the models, we set the device parameter to GPU or CPU. This allows us to run a simple model.compile() to compile the models and get them ready for execution.

*Figure 3: Snippet — Compiling the CLIP models of the pipeline.*

Step 3: Run the models

We wrap some of the post-process inferencing elements into functions, then run the models in a sequence in a couple of lines of code as seen below:

*Figure 4: Running the pipeline for inference.*

Step 4: Enjoy the result!

The results generate in a few seconds, with Whisper taking about 0.46s, RedPajama-Incite taking 2s, Stable Diffusion v2.1 taking 1–2s, and CLIP taking 6 seconds.

If we were to run our entire pipeline, including Stable Diffusion, on CPU, our runtime may have increased to 20–30s. With Hybrid AI, we considered offloading our Stable Diffusion v2.1 model to the GPU, and were able to get under 10s of runtime for the full pipeline!

You can also experiment with Stable Diffusion XL, where running this model in the pipeline on a CPU can take much longer than if we leverage our GPU.

We can visualize the final results below, with the result generated from our Stable Diffusion model, and the saliency map providing the answer to our query.

To explore the details of these models and the OpenVINO™ implementation, check out the notebook and Gradio app here, along with the instructions to get started! Let us know your feedback!