Mastering Multimodal AI with LLaVA-NeXT and Advanced Quantization Techniques (NNCF)

OpenVINO™ toolkit

Published in

OpenVINO-toolkit

7 min readMay 22, 2024

Revolutionizing AI Interaction: The Power of Multimodal Systems and OpenVINO™ (Part 2)

Author: Anisha Udayakumar, AI Software Evangelist at Intel

Continuing from our previous discussion on the capabilities of multimodal AI, this second part of our blog series delves into the intricate workings of LLaVA-NeXT and the advanced quantization techniques (NNCF) used to optimize these models for real-world applications. If you haven’t read the first part focusing on Pix2Struct and Optimum Intel, you can find it here.

Understanding LLaVA-NeXT

LLaVA-NeXT (Large Language and Vision Assistant — Next Generation) is a sophisticated multimodal model designed for advanced language reasoning over images. Unlike Pix2Struct, which excels in Document Visual Question Answering (DocVQA), LLaVA-NeXT combines the power of large language models (LLMs) with vision encoders like CLIP to create a general-purpose visual assistant. This model can follow both language and image instructions to complete various real-world tasks, making it suitable for creating complex multimodal chatbots.

LLaVA-NeXT introduces improved OCR capabilities and expanded world knowledge, marking a significant breakthrough in advanced language reasoning over images. Its complex structure requires detailed quantization steps for optimization using OpenVINO™ and NNCF. In this blog, we will explore the LLaVA-NeXT Multimodal Chatbot Notebook and learn how to convert and optimize the LLaVA-NeXT model to create a multimodal chatbot. Additionally, we will explore how to apply stateful transformation on the LLM part and model optimization techniques like weight compression and quantization using NNCF.

Step-by-Step Implementation in OpenVINO™ Notebooks:

1. Install Prerequisites:

Begin by setting up your development environment to use the OpenVINO™ toolkit. This involves installing the OpenVINO™ toolkit along with the necessary libraries and dependencies that support the LLaVA-NeXT model.

2. Model Download and Loading:

Designed for intricate language and vision tasks, LLaVA-NeXT requires a detailed and customized approach. It consists of several components that need individual attention during the optimization process:

· Image Encoder: Manages visual inputs, typically based on a sophisticated vision model like CLIP.

· Input Embeddings: Responsible for embedding the input text effectively.

· Language Model: Generates responses based on the integrated understanding of both visual and textual data.

For sophisticated language and vision tasks, LLaVA-NeXT integrates complex functionalities that require precise model handling. Here’s how you can load this model:

1. from transformers import LlavaNextProcessor, LlavaNextForConditionalGeneration
2. processor = LlavaNextProcessor.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
3. model = LlavaNextForConditionalGeneration.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")

This code snippet properly loads the LLaVA-NeXT model, ensuring that all its components — ranging from language processing to image understanding — are loaded and ready for further conversion and optimization processes.

3. Model Conversion to OpenVINO™ IR:

The LLaVA-NeXT model requires a nuanced approach due to its complex structure. It involves individual optimization of its three main components — Image Encoder, Input Embeddings, and Language Model. OpenVINO™ supports this transition from PyTorch models by converting them into OpenVINO IR using the OpenVINO™ model conversion API. The ov.convert_model function accepts an original PyTorch model instance and an example input for tracing, returning an ov.Model that can be saved for deployment.

· Image Encoder: The Image Encoder, typically a pre-trained vision model, like CLIP, within the LLaVA-NeXT , is converted to OpenVINO™’s IR format. This step ensures that the model can efficiently handle visual inputs on platforms supported by OpenVINO™:

4. ov_image_encoder = ov.convert_model(image_encoder_model, example_input=torch.zeros((1, 5, 3, 336, 336)))
5. ov.save_model(ov_image_encoder, IMAGE_ENCODER_PATH)

· Text Embedding: Transforms text inputs into a suitable format for processing, optimized separately for handling textual data:

6. ov_input_embeddings_model = ov.convert_model(input_embedding_model, example_input=torch.ones((2, 2), dtype=torch.int64))
7. ov.save_model(ov_input_embeddings_model, INPUT_EMBEDDING_PATH)

· Language Model: The Language Model is crucial as it synthesizes processed inputs from both the Image Encoder and Input Embeddings to generate coherent and contextually relevant text responses.

To optimize performance, OpenVINO™ leverages two advanced features:

Caching Mechanism: Uses the use_cache=True parameter and past_key_values from the Transformers library to cache and reuse hidden states, reducing computational load.

Stateful Model Transformation: Transforms the model into a stateful one, internally managing cache tensors to reduce input/output overhead during inference.

8. def make_stateful(
9. ov_model: ov.Model,
10. not_kv_inputs: List[str],
11. key_value_input_names: List[str],
12. key_value_output_names: List[str],
13. batch_dim: int,
14. num_attention_heads: int,
15. num_beams_and_batch: int = None,
16. ):
17. from openvino._offline_transformations import apply_make_stateful_transformation

4. Quantization Using OpenVINO™:

Weight Compression Using NNCF:

To reduce the memory footprint and improve the inference performance of the Language Model, weight compression is applied using the Neural Network Compression Framework (NNCF). This method is highly effective for large memory-bound models like LLMs.

INT4 Compression for Language Model: Apply NNCF’s 4-bit weight compression to reduce memory consumption and improve execution speed. While this may slightly reduce prediction quality due to lower precision, it is crucial for efficient deployment of large models.

18. import nncf
19. compression_configuration = {
20.     "mode": nncf.CompressWeightsMode.INT4_SYM,
21.     "group_size": 64,
22.     "ratio": 0.6,
23. }
24. # Check if weight compression is enabled
25. if to_compress_weights.value and not LANGUAGE_MODEL_PATH_INT4.exists():
26.     ov_model = core.read_model(LANGUAGE_MODEL_PATH)
27.     ov_compressed_model = nncf.compress_weights(ov_model, **compression_configuration)
28.     core.save_model(ov_compressed_model, LANGUAGE_MODEL_PATH_INT4)

INT8 Quantization for Image Encoder: Use NNCF’s post-training quantization to optimize the Image Encoder for faster inference by reducing operational precision to 8 bits.

29. # Load the pre-quantization model
30. ov_model = core.read_model(IMAGE_ENCODER_PATH)
31. 
32. # Prepare the calibration data necessary for quantization
33. calibration_dataset = nncf.Dataset(calibration_data)
34. # Execute the quantization process
35. quantized_model = nncf.quantize(
36.     model=ov_model,
37.     calibration_dataset=calibration_dataset,
38.     model_type=nncf.ModelType.TRANSFORMER,
39.     subset_size=len(calibration_data),
40.     advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6))
41. # Save the quantized model for deployment
42. ov.save_model(quantized_model, IMAGE_ENCODER_PATH_INT8)

While INT4 compression offers greater performance improvements by reducing precision further, it can impact model accuracy more noticeably than INT8. However, a significant advantage of NNCF’s weight compression, particularly INT4, is that it is data-free, not requiring a calibration dataset, which simplifies the compression process.

5. Device Selection and Configuration:

Choosing the appropriate hardware device for inference is crucial for optimal performance. This involves configuring OpenVINO™ runtime to utilize specific devices like CPU, GPU, or NPU, depending on availability and application requirements.

6. Inference Pipeline Setup:

Setting up the inference pipeline is essential for configuring the inference settings and preparing the models to run predictions effectively. This process utilizes the OVLlavaForCausalLM class for generating contextually relevant responses.

5. Execution and Results Demonstration:

Let’s see LLaVA-NeXT in action. This application allows users to input both text and images to interact with the multimodal chatbot. The Gradio interface facilitates testing on how LLaVA-NeXT processes and responds to combined inputs, showcasing its capability to handle sophisticated language and vision tasks effectively.

Multimodal AI, empowered by advanced quantization techniques, brings a new level of efficiency and responsiveness to AI systems. The LLaVA-NeXT model, with its intricate handling of language and vision tasks, demonstrates the potential of these systems to create more natural and intuitive interactions. As technology continues to evolve, the integration of these advanced techniques ensures that AI systems can meet the growing demands of real-world applications, providing robust and reliable performance.

By leveraging the power of OpenVINO™ and advanced quantization methods, we can deploy these complex AI systems effectively, ensuring that they adapt to our needs and enhance our interactions with technology. LLaVA-NeXT can be used in retail for intelligent shopping assistants and visual search tasks, in healthcare for medical imaging analysis and interpreting complex documents, and in customer service for handling inquiries involving both textual and visual information. Stay tuned as we explore further advancements in AI technology and delve deeper into the capabilities of multimodal AI systems.

Additional Resources

OpenVINO™ Documentation

OpenVINO™ Notebooks

Provide Feedback & Report Issues

About the Author

Anisha Udayakumar is an AI Software Evangelist at Intel, specializing in the OpenVINO™ toolkit. At Intel, she enriches the developer community by showcasing the capabilities of OpenVINO, helping developers elevate their AI projects. With a background as an Innovation Consultant at a leading Indian IT firm, she has guided business leaders in leveraging emerging technologies for innovative solutions. Her expertise spans AI, Extended Reality, and 5G, with a particular passion for computer vision. Anisha has developed vision-based algorithmic solutions that have advanced sustainability goals for a global retail client. She is a lifelong learner and innovator, dedicated to exploring and sharing the transformative impact of technology. Connect with her on LinkedIn.

Notices & Disclaimers

Intel technologies may require enabled hardware, software, or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.