Unleashing the Power of Multimodal AI with Pix2Struct and Optimum Intel

OpenVINO™ toolkit

Published in

OpenVINO-toolkit

6 min readMay 22, 2024

Revolutionizing AI Interaction: The Power of Multimodal Systems and OpenVINO™ (Part 1)

Author: Anisha Udayakumar, AI Software Evangelist at Intel

Have you ever wondered how AI can understand a scene the same way a human does? Multimodal AI is the key — it processes visual, auditory, and textual data simultaneously to interpret its environment with remarkable depth and accuracy. In this blog series, we’ll dive into the capabilities of multimodal AI and explore how Intel’s OpenVINO™ Toolkit optimizes these complex systems for real-world applications. This first part focuses on Pix2Struct and its optimization using Optimum Intel. To learn more about LLaVA-NeXT, read the second part here.

What is Unimodal AI?

As we delve deeper into the capabilities of AI, it’s important to understand the different approaches systems can take while processing data. Traditionally, many AI systems have relied on unimodal processing, focusing on a single type of data input. While this method has its merits, it also exhibits significant limitations, especially in complex environments that mimic human interaction. In this blog, we will explore why the evolution towards multimodal AI represents a significant leap in the field.

Why Shift to Multimodal AI?

Multimodal AI revolutionizes how artificial intelligence understands and interacts with the world by mimicking human sensory and cognitive functions. These systems integrate multiple types of data — vision, audio, and text — to achieve a complexity and understanding that surpasses traditional AI systems. By processing these diverse streams, multimodal AI can fully comprehend context, make informed decisions, and interact naturally. The strength of multimodal AI lies in its ability to integrate diverse sensory data like how humans interact with the world. This comprehensive approach to data processing enables these systems to understand context more fully, make more informed decisions, and interact more naturally.

Advantages of Multimodal AI:

Richer Data Processing: Multimodal AI merges insights from various data types, providing a more complete understanding of complex situations.
Improved Accuracy: By validating data across multiple sources, these systems achieve higher accuracy and reliability.
Greater Robustness: Multimodal AI remains functional even when one type of input is compromised, unlike unimodal systems.

Pix2Struct Specialization and Practical Applications

Pix2Struct is a multimodal model for understanding visually situated language that easily copes with extracting information from images, designed to excel in tasks involving Document Visual Question Answering (DocVQA). It can handle documents with complex layouts and structures, such as tables and diagrams, which are challenging for traditional OCR systems. Pix2Struct not only recognizes and extracts text but also understands the context in which the text appears, allowing it to answer questions about the document’s content rather than simply providing a digital version. This makes Pix2Struct ideal for automating document-based workflows, information retrieval, document analysis, and summarization.

From Theory to Practice: OpenVINO™ Notebooks

Understanding the advantages of multimodal AI provides a solid theoretical foundation, but seeing these systems in action brings their potential to life. To bridge the gap between conceptual understanding and practical application, developers can explore OpenVINO™ Notebooks. These resources offer step-by-step guides that illustrate the process and demonstrate how to fully leverage the capabilities of multimodal AI alongside other popular AI models with OpenVINO™ optimization and deployment.

Pix2Struct for DocVQA Notebook:

This notebook walks through the steps of model downloading, conversion to IR, and execution, demonstrating the simplicity and power of using Optimum Intel to optimize the AI model.

Step-by-Step Implementation in OpenVINO™ Notebooks:

Install Prerequisites:

Begin by setting up your development environment to use the OpenVINO™ toolkit. This involves installing the OpenVINO™ toolkit along with necessary libraries and dependencies that support both the Pix2Struct model. This setup ensures that all tools are in place for model optimization and deployment, providing a unified platform for working with diverse AI applications.

2. Model Download and Loading:

This step involves retrieving the Pix2Struct model from the Hugging Face Transformers library and preparing it for optimization. We will use the pix2struct-docvqa-base model from Hugging Face. Optimum Intel can load these optimized models and create pipelines to run inference with OpenVINO Runtime using Hugging Face APIs.

3. Model Conversion to OpenVINO™ Intermediate Representation (IR):

Next, the Pix2Struct model is converted to OpenVINO™ IR format. Optimum Intel simplifies this process, significantly enhancing the model’s performance across various hardware. Optimum Intel can load optimized models directly from the Hugging Face. This approach makes it easier for developers by managing both model loading and optimization seamlessly.

To begin, model class initialization starts with calling the from_pretrained method. When downloading and converting the Transformers model, the parameter export=True should be added. To reduce memory consumption, we can compress the model to float16 using the half() method.

1. from optimum.intel.openvino import OVModelForPix2Struct
2. # Specify the model ID from Hugging Face
3. model_id = "google/pix2struct-docvqa-base"
4. # Load the model in OpenVINO IR format
5. ov_model = OVModelForPix2Struct.from_pretrained('pix2struct-model-id', export=True, compile=False)

This line of code loads the Pix2Struct model directly from Hugging Face.

4. Device Selection and Configuration:

Choosing the appropriate hardware device for inference is crucial for optimal performance. This involves configuring OpenVINO™ runtime to utilize specific devices like CPU, GPU, or NPU, depending on availability and application requirements.

5. Inference Pipeline Setup:

Setting up the inference pipeline is essential for configuring the inference settings and preparing the models to run predictions effectively. This includes preprocessing data for Pix2Struct.

For Pix2Struct, the following steps outline the setup:

· Data Preparation: Utilize the Pix2StructProcessor for preprocessing the input data.

· Model Inference: Use the OVModelForPix2Struct.generate method will launch answer generation.

· Answer Decoding: Decode the generated answer token indices into text format using Pix2StructProcessor.decode.

6. from transformers import Pix2StructProcessor
7. processor = Pix2StructProcessor.from_pretrained(model_id)
8. ov_model = OVModelForPix2Struct.from_pretrained(model_dir, device=device.value)

6. Execution and Results Demonstration:

Let’s see the Pix2Struct model in action. For a hands-on interactive experience, we use Gradio, a Python library that allows the creation of customizable UIs for our models. This setup enables users to directly interact with the Pix2Struct model:

The Gradio app provides an intuitive interface where users can upload images of documents to see the model perform Document Visual Question Answering (DocVQA) tasks in real time. This demonstration showcases the model’s ability to interpret complex document layouts and extract actionable insights.

In this exploration of multimodal AI and Pix2Struct, we’ve seen how integrating multiple sensory data streams enhances AI efficiency and responsiveness. Pix2Struct excels in understanding and interpreting complex document layouts, providing accurate, context-aware answers.

Leveraging Intel’s OpenVINO™ Toolkit and Optimum Intel simplifies model optimization and deployment, ensuring high performance across various hardware platforms. The practical applications of Pix2Struct, such as automating document-based workflows and information retrieval, demonstrate the transformative potential of multimodal AI.

What about even more complex models like LLaVA-NeXT that aren’t part of the Optimum Intel ecosystem? Stay tuned for the second part of this series, where we’ll delve into LLaVA-NeXT’s unique capabilities and explore optimization approaches for such advanced models.

Additional Resources

OpenVINO™ Documentation

OpenVINO™ Notebooks

Provide Feedback & Report Issues

About the Author

Anisha Udayakumar is an AI Software Evangelist at Intel, specializing in the OpenVINO™ toolkit. At Intel, she enriches the developer community by showcasing the capabilities of OpenVINO, helping developers elevate their AI projects. With a background as an Innovation Consultant at a leading Indian IT firm, she has guided business leaders in leveraging emerging technologies for innovative solutions. Her expertise spans AI, Extended Reality, and 5G, with a particular passion for computer vision. Anisha has developed vision-based algorithmic solutions that have advanced sustainability goals for a global retail client. She is a lifelong learner and innovator, dedicated to exploring and sharing the transformative impact of technology. Connect with her on LinkedIn.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.