Multi-image Phi-3-vision test

alex buzunov
CodeX
Published in
5 min readJun 12, 2024

--

Let’s use AI-driven chat that leverages the cutting-edge Phi-3 Vision model for multi-modal deep learning and creative image synthesis.

By seamlessly integrating natural language processing (NLP) and computer vision techniques, we will explore fascinating realm of generative AI, where textual descriptions and visual elements converge to produce imaginative and highly contextual outputs.

Architecture

The script adheres to a modular and architecture, ensuring AI-powered chat functionality. The core components include:

  1. MyFrame: The central partof the application, housing the workspace and menu bar, facilitating smooth navigation and user interaction.
  2. Workspace: A dynamic notebook structure that accommodates multiple workspace panels, each representing a distinct AI model or vendor, enabling seamless switching between diverse generative capabilities.
  3. WorkspacePanel: A comprehensive panel within the workspace, encompassing a vendor notebook, chat input panel, and log panel, providing a unified interface for user input, model interaction, and real-time feedback.
  4. VendorNotebook: A customizable notebook that hosts vendor-specific chat display panels, allowing users to engage with a wide array of generative models and experiment with their unique features.
  5. ChatDisplayNotebookPanel: An interactive panel within the vendor notebook that presents the chat history and supports the addition of new chat tabs, facilitating organized and context-aware conversations with the AI models.
  6. ChatInputPanel: A dedicated panel for user input and direct interaction with the AI model, enabling dynamic prompts, parameter adjustments, and fine-grained control over the generative process.
  7. LogPanel: A real-time monitoring panel that displays log messages and model output, providing valuable insights into the AI’s thought process and assisting in debugging and optimization.

Multi-Modal Deep Learning and Creative Image Synthesis

The script’s multi-image mode unlocks the true potential of multi-modal deep learning by allowing users to seamlessly blend textual descriptions with multiple visual inputs. By harnessing the power of the Phi-3 Vision model, a state-of-the-art generative AI architecture, the application enables the synthesis of highly creative and contextually relevant images that merge the essence of the provided descriptions and visual elements.

In this example it lets u mis descriptions of 4 images:

Let’s dive into the core functionality that drives this feature:


prompt = processor.tokenizer.apply_chat_template(chatHistory, tokenize=False, add_generation_prompt=True)
inputs = processor(prompt, images, return_tensors="pt").to("cuda:0")

generation_args = {
"max_new_tokens": max_new_tokens ,
"temperature": chat.temp_val,
"do_sample": False,
}

generate_ids = model.generate(**inputs, eos_token_id=processor.tokenizer.eos_token_id, **generation_args)

The AskQuestion method in the Microsoft_Copilot_InputPanel class orchestrates the multi-modal deep learning process. It retrieves the selected images and user question from the chat display panel, constructing a rich prompt that encapsulates the desired context and visual elements.

The prompt, along with the image paths, is then passed to the stream_response method, which initiates a separate thread to engage the Phi-3 Vision model in a generative inference process. By leveraging advanced techniques such as attention mechanisms, transformers, and adversarial networks, the model synthesizes a creative output that harmoniously blends the textual descriptions and visual elements, producing a truly unique and imaginative result.

Real-Time Inference and Response Streaming

To ensure a seamless and interactive user experience, the script employs the VisionResponseStreamer class, which facilitates real-time inference and response streaming from the Phi-3 Vision model. As the model generates the creative output in chunks, the stream_response method dynamically updates the chat display panel, providing users with a captivating glimpse into the generative process.

Example using 3 images:

Logging and Performance Monitoring

The script incorporates comprehensive logging functionality through the LogPanel class, enabling users to monitor the application's performance, track model behavior, and identify potential bottlenecks or errors. By capturing log messages and model output in real-time, developers can efficiently debug, optimize, and fine-tune the multi-modal deep learning pipeline, ensuring optimal results and a smooth user experience.

Example using 10 images:

It seems to describe only first image by default even thou it’s given more than one:

When asked to describe first image:

When asked to describe both “in order” it places first on left and second or right and thee connects them by theme:

Ask for mixed description but looks like 2 mostly separate descriptions

Now asking for creative spin and get coherent scene description without reference to second image.

creatively mix 2 image descriptions. Provide a detailed and artistically

rich description of the image provided by the user.

Focus on capturing the essence, mood, and atmosphere.

Highlight intricate details, colors, textures, and emotions.

Use evocative language to paint a vivid picture that brings

the image to life in the reader’s mind.

It seems not to be able to reference images by number. Let’s try a new test

It’s having hard times identifying where’s number one. It finds 2 but u cannot as “which image contains number one”. you can only ask “which number is in first image” if you want to identify “1”

Asking where’s number 3 is too indirect — it starts to hallucinate (there’s no 1 and 4 in the image):

Conclusion

The provided wxPython script exemplifies the transformative potential of multi-modal deep learning in creative image synthesis. By leveraging the Phi-3 Vision model and its ability to seamlessly blend textual descriptions with multiple visual inputs, the application empowers users to explore uncharted territories of generative AI. We were able to harness the power of advanced deep learning techniques to create stunning and highly contextual images that push the boundaries of imagination. As the field of generative AI continues to evolve, this script serves as a testament to the exciting possibilities that lie ahead, inviting developers and enthusiasts alike to embark on a captivating journey of creative exploration and innovation.

--

--