Chat-Based Image Analysis Using Google Gemini Vision and wxPython

5 min readJun 24, 2024

In today’s digital landscape, advanced image analysis tools are crucial for developers and data scientists. One such integration involves Google Gemini Vision and a wxPython user interface, creating a chat-based image analysis tool.

Google Gemini Vision leverages AI-driven models to interpret and describe visual content. When combined with wxPython, a versatile GUI toolkit for Python, the result is an interactive application that facilitates detailed image analysis through a chat-based system.

In this article we utilize Google Gemini models vision capabilities Vertex AI API. Check here if you are looking for no parameters Generative AI Gemini example with guards on by default.

VisionResponseStreamer

The VisionResponseStreamer class is designed to handle real-time streaming responses from an AI-driven image analysis model, integrating with Google Vertex AI. Below is a detailed breakdown of its components and functionalities

Initialization

class VisionResponseStreamer:
    def __init__(self():
        self.model = {}
        self.tokenizer = {}

The constructor initializes the class with placeholders for the model and tokenizer, setting up the necessary structure for later use.

Streaming Response Method

def stream_response(self, text_prompt, chatHistory, receiving_tab_id, image_path):
    out = []
    from os.path import isfile
    chat = apc.chats[receiving_tab_id]
    header = fmt([[f'{text_prompt}Answer:\n']], ['Question | ' + chat.model])
    pub.sendMessage('chat_output', message=f'{header}\n', tab_id=receiving_tab_id)

The method stream_response processes a text prompt and image path to generate and stream AI responses. It initializes an empty list out for output, fetches the chat configuration, and sends a formatted header to the chat.

Model Initialization and Configuration

    try:
        import vertexai
        from vertexai import generative_models
        from vertexai.generative_models import Part, Image as VertexAIImage
        from io import BytesIO
        from PIL import Image as PILImage

        PROJECT_ID = "your-project-id"
        LOCATION = "us-central1"
        vertexai.init(project=PROJECT_ID, location=LOCATION)

        model = generative_models.GenerativeModel(model_name=chat.model)
        generation_config = generative_models.GenerationConfig(
            max_output_tokens=int(chat.max_tokens), temperature=float(chat.temperature), top_p=chat.top_p, top_k=chat.top_k
        )

This block imports necessary modules and initializes the Vertex AI client with the specified project and location. It configures the model with parameters like maximum output tokens, temperature, top_p, and top_k to control the generation process.

Safety Configuration and Image Processing

        safety_config = [
            generative_models.SafetySetting(category=generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT, threshold=generative_models.HarmBlockThreshold.BLOCK_NONE),
            generative_models.SafetySetting(category=generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT, threshold=generative_models.HarmBlockThreshold.BLOCK_NONE),
            generative_models.SafetySetting(category=generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT, threshold=generative_models.HarmBlockThreshold.BLOCK_NONE),
            generative_models.SafetySetting(category=generative_models.HarmCategory.HARM_CATEGORY_UNSPECIFIED, threshold=generative_models.HarmBlockThreshold.BLOCK_NONE),
            generative_models.SafetySetting(category=generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH, threshold=generative_models.HarmBlockThreshold.BLOCK_NONE),
        ]

This class effectively combines AI-driven image analysis with real-time interaction, leveraging Google Vertex AI to provide detailed image descriptions via a chat-based interface.

Models used

The Google Gemini Vision platform offers a suite of advanced AI-driven image analysis models, each designed to cater to different levels of complexity and application requirements. The available models in the model_list include:

models/gemini-1.5-flash: A significant upgrade, the 1.5-flash model boasts faster processing speeds and improved detail in image descriptions.
models/gemini-1.5-flash-001: An incremental update to the 1.5-flash, providing even finer accuracy and robustness.
models/gemini-1.5-pro: This version combines the speed of the flash models with advanced features for more complex image analysis tasks.
models/gemini-1.5-pro-001: A refined version of the 1.5-pro, focusing on further optimization of processing power and accuracy.

Screenshots

One Image

The screenshot displays an interface of the Google VertexAI Gemini Vision tool, showcasing a chat-based image analysis feature. The left panel contains the user’s query and the AI’s detailed, artistic description of the image, while the right panel displays the analyzed image.

Interface Elements

Query Panel: Located on the left, it shows the user’s request for a detailed and artistic description of the image.
AI Response: Below the query, the AI provides a rich, evocative narrative of the image, capturing its essence, mood, and intricate details.
Image Display: The right panel displays the image being described by the AI, providing a visual reference that complements the textual description.
Controls and Parameters: At the bottom, various control buttons (Paste, Hist, Sys, Rand, Ask) and parameters (max_tokens, temperature, top_k, top_p) allow the user to fine-tune the AI’s output and interact with the tool.

10 images/Image fusion

The screenshot showcases the Google VertexAI Gemini Vision tool in action, performing image fusion to provide a unified, creatively mixed description for multiple images. The left panel displays the user’s query and the AI’s detailed, imaginative description. The right panel shows one of the images, featuring three black and white cats dressed in traditional Ukrainian attire, sitting on a wooden bench.

User Query

The user requests a detailed, creative mix of all image descriptions as one, without splitting them, aiming to be as creative and weird as possible.

AI Response

The AI generates a surreal and imaginative narrative that fuses elements from all the images into a cohesive, fantastical story:

Surreal Dreamscape: A woman adorned with yellow and blue roses stands against a vibrant tapestry of flowers, her back to the viewer, seamlessly blending with the background.
War-torn Landscape: A mother cradles her sleeping child amidst falling bombs and a fiery sky.
Chilling Figure of Putin: Vladimir Putin appears with his followers under a menacing, lightning-lit sky.
Ukrainian Spirit: A woman with wings emerges from a starry void, her silhouette painted with the Ukrainian flag colors.
Duality of Women: Two women, dressed in yellow and blue, dance against a cosmic backdrop, symbolizing hope and beauty amidst adversity.
Majestic Cats: Black and white cats in traditional Ukrainian attire sit on a wooden bench, their eyes reflecting wisdom and resilience.
Field of Flowers: A woman crowned with flowers stands in a vibrant field of blooms.
Female Soldier: A soldier, face smudged with dirt, gazes unwaveringly, ready to defend herself with a pistol in hand.
Final Tapestry: This combined narrative captures the essence of all images, creating a rich, imaginative fusion that conveys both beauty and the complexities of the Ukrainian spirit

Conclusion

Integrating Google Vertex AI with a wxPython UI provides a powerful tool for generating detailed image descriptions, enhancing projects with rich, AI-generated insights. This article highlights its implementation, showcasing the future of intelligent mage analysis.

Source

wxchat/google_vertexai_vision.py at google_vertexai_vision · myaichat/wxchat

streaming gptchat api using wxpython. Contribute to myaichat/wxchat development by creating an account on GitHub.

github.com