Gemini-1.5-Flash & Pro-Vision-001: GenAI with 20/20 Vision

Gemini Multimodality: GenAI Decodes More Than Just Pixels!

Published in

Google Cloud - Community

6 min readJul 12, 2024

In the age of information overload, visuals — from charts to infographics — have become our go-to for quick, digestible insights. But what if our machines could not only ‘see’ these visuals but truly ‘understand’ them? Enter the world of Gemini-Flash and Gemini-Pro-Vision, two AI models poised to revolutionize how we interact with graphical data.

We will dive into some fascinating graphs about the Cricket 🏏 World Cup 🏆, Fruit Sugar Fights 🍎🍌, Battery Blues 🔋, Chip Bag Mysteries 🥔, TV Show Characters 📺🎬🎭, and Olympic Host Cities 🏅🏟️.

Google Cloud Vertex AI Tech (‘Multimodal’ section)

Gemini-Flash and Gemini-Pro-Vision: Seeing the Unseen

Gemini-Flash and Gemini-Pro-Vision aren’t just image recognition or captioning tools; they’re visual interpreters. They combine cutting-edge computer vision with the analytical power of large language models (LLMs). This means they don’t just identify objects in an image; they understand the context, relationships, and even subtle implications hidden within the visual data.

How do LLMs “read” graphical visuals?

The following process is fascinating and explained very briefly.

Pixel Power: The model begins by analyzing the raw pixels of an image, much like our eyes detect light.
Object Recognition: It then identifies objects within the image—a pie chart, a bar graph, a scatter plot, or even details of the characters in a sitcom scene.
Data Extraction: For graphs, the model extracts numerical data points, axis labels, and any accompanying text.
Contextual Understanding: This is where the magic happens. The LLM taps into its vast knowledge base (acquired through training on massive datasets and augmented with Retrieval Augmented Generation (RAG) techniques). It interprets the extracted data in context, compares it to known patterns, and applies logical reasoning.
Insightful Narrative: Finally, the model generates a detailed, human-readable analysis of the visual. It can summarize trends, highlight key points, compare data across different graphs, and even make predictions or draw inferences.

A Tour Through Visual Stories

We use the simple prompt, ‘Explain each part of the following image in detail in points with all the metrics and related observations (if necessary).’ having 25 tokens. With a sample example image (258 tokens), it is 283 tokens, which is far less than the 1 million input token limit of the discussed models. The importance of this extremely large 1M context window will come in handy for large data sets such as audio and video.

Now, all you require is an image! In this blog, the graphic data size is less than 7 MB, so we can directly paste it in the console or upload from the local. For those exceeding 7 MB (typically, audio, video, or large images/PDFs), they need to be inserted after uploading to a GCS (Google Cloud Storage) bucket.

Let’s see Gemini-Flash and Gemini-Pro-Vision in action:

Fruit Sugar Content: A simple bar graph comparing sugar levels in different fruits is transformed into a nutritional analysis, highlighting the healthiest choices and potential risks of overconsumption.

Left Image Source: https://www.ranker.com/

TV Show Characters: An image of the ‘Dunder Mifflin’ crew isn’t just recognized; the models analyze facial expressions and body language to infer relationships, moods, and potential storylines.

Battery Life: A bar graph showing battery drain over time becomes a detailed report on device usage patterns, potentially leading to recommendations for optimizing battery life.

Air in Chip Bags: A chart depicting the percentage of air in chip bags is no longer just a meme; it’s a starting point for a discussion on packaging efficiency and environmental impact.

Sports Graphs (Olympics & Cricket): Whether it’s a run rate chart or points table in cricket or a medal tally for the Olympics, these models can uncover trends, predict outcomes, and highlight individual or team performances. Surprisingly, LLM can correlate what it knows with what it sees in the following images:

Left Image Source: https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities

Left Image Source: https://www.jagranjosh.com/

The Future is Visual

The potential applications are endless. Imagine:

Medical Imaging: Analyzing X-rays or scans to assist doctors in diagnosis.
Financial Reports: Uncovering hidden insights in complex charts and graphs.
Social Media: Understanding the impact of visuals on online engagement.
Education: Making complex concepts accessible through interactive visual explanations.

The Takeaway

Gemini-Flash and Gemini-Pro-Vision are not just tools; they’re our partners in understanding the visual world. By bridging the gap between raw data and meaningful insights, they’re making us smarter, more informed, and better equipped to navigate the information age.

Let’s keep exploring this graphical revolution together!

Are there any specific types of advanced visuals you’d like to see these models tackle? For example, can LLM predict the next best move in the game of chess? Is LLM intelligent enough to predict a LBW/caught-out/run-out decision from the Cricket visuals, video, etc.? Can the LLM detect fouls and goals in other sports from related visuals with no ‘human-in-the-loop’ requirement?

Share your thoughts in the comments below!

In upcoming blogs, we’ll be going through advanced image-related use cases. Also, we’ll analyze audios, videos, PDFs, and other forms of data (multimodality).

References

Develop your ability to create prompts that accurately extract data from images for more complex use cases.

Design multimodal prompts | Generative AI on Vertex AI | Google Cloud

Try Gemini 1.5 models, the latest multimodal models in Vertex AI, and see what you can build with up to a 2M token…

cloud.google.com

21 Interesting Charts That Made Us Laugh And Think About Things Differently

Ever wonder which fruit has the most sugar? Or which language is spoken the fastest? Here is a list of interesting…

www.ranker.com

Document link with all the Gemini-Flash and Pro results along with comparisons (please request access!): https://docs.google.com/document/d/1isrw4zkETAVjQRubuX0dQsi3YQZWW-51QfwD6P9nq8g/edit?tab=t.0

Colab link with all the detailed code cells for using Gemini-Flash and Pro (please request access!):

Google Colab

Edit description

colab.sandbox.google.com

Note: Should you have any concerns or queries about this post or my implementation, please feel free to connect with me on LinkedIn! Thanks!