Image Understanding, Processing, Classification and Batch Analysis with Gemini Pro & Pro Vision

12 min readMar 12, 2024

This article include demos for my recent presentations at

5th March 2024: March Monthly Meetup hosted by Artificial Intelligence & Machine Learning Malaysia
7th March 2024: Tech Talk 1.0 hosted by Google Developer Student Club at Monash University Malaysia
9th March 2024: International Women’s Day hosted by Google Developer Group Kuala Lumpur & Women Techmakers Kuala Lumpur

Google Gemini

Google Gemini is a family of multimodal large language models developed by Google DeepMind. It is capable of dealing with multiple modalities of data (image, video, sound, etc.) and positioned as a contender to OpenAI’s GPT-4.

The Vertex AI Gemini API provides a unified interface for interacting with Gemini models. There are currently two models available in the Gemini API:

Gemini Pro model (gemini-pro): Designed to handle natural language tasks, multiturn text and code chat, and code generation.
Gemini Pro Vision model (gemini-pro-vision): Supports multimodal prompts. You can include text, images, and video in your prompt requests and get text or code responses.

We can explore the Gemini Pro and Gemini Pro Vision models via the MakerSuite UI within Google AI Studio or Python Software Development Kit (SDK). Google AI Studio is a browser-based IDE for prototyping with generative models.

Image Understanding with Gemini Pro Vision via Google AI Studio

To access Google AI studio, go to https://aistudio.google.com/

Now, we create a freeform prompt to understand the images as below:

Describe these two images in detail, explain what is happening also mention the book and the author

The sample images are as below:

These two images are from the comic Batman: The Killing Joke written by Alan Moore and illustrated by Brian Bolland. Feel free to place your own preferred images to explore.

Upon completing the freeform prompt, click “Run” and you will see the output.

The first image is a comic book page from the Batman: The Killing Joke graphic novel. The Joker has taken over an amusement park and is trying to prove to Batman that anyone can be driven insane. He has kidnapped Commissioner Gordon and his daughter Barbara, and is torturing them in an attempt to break Batman's spirit.

The second image is also from the Batman: The Killing Joke graphic novel. Batman has tracked the Joker to the amusement park and is confronting him. The Joker is trying to convince Batman that he is just as crazy as he is, and that there is no difference between them.

The book is written by Alan Moore and illustrated by Brian Bolland. It is a dark and psychological story that explores the themes of madness and violence.

From the output, we can see the initial information about the book and author is accurate. The rest of the output describes what the model understands from the image.

Image Understanding with Gemini Pro Vision via Python SDK

The Colab notebook for this session can be found here: https://github.com/pytee/geminiintro/blob/main/Demo2GeminiProVision.ipynb

Before running the Colab notebook, an API key has to be obtained. If you don’t already have one, create a key with one click in Google AI Studio (back to the first section where we use Google AI Studio).

Get API Key from the button on the top left

Back to the Colab notebook, add the key to the secrets manager under the “🔑” in the left panel. Give it the name GEMINI_API_KEY.

Code Walkthrough

!pip install -q -U google-generativeai

This line installs the google-generativeai library.

import pathlib
import textwrap
import google.generativeai as genai
from IPython.display import display
from IPython.display import Markdown
import PIL.Image
import urllib.request
from PIL import Image

This section installs packages for handling images and displaying outputs.

# Used to securely store your API key
from google.colab import userdata
# Or use `os.getenv('GOOGLE_API_KEY')` to fetch an environment variable.
GOOGLE_API_KEY=userdata.get("GEMINI_API_KEY")
genai.configure(api_key=GOOGLE_API_KEY)

This section set up the storage for the API key within Google Colab.

for m in genai.list_models():
    if "generateContent" in m.supported_generation_methods:
        print(m.name)

This script lists and prints the available model in google-generativeai library.

# Opening the image for Image Understanding
urllib.request.urlretrieve(
    'https://i.postimg.cc/x1XnKCvV/RxsIzEy.png',
   "comic.png")
image = PIL.Image.open('comic.png')
image

urllib download an image from a specified URL and open it using PIL. The image is then displayed in the Colab notebook. Feel free to replace the URL with your own image link.

model = genai.GenerativeModel("gemini-pro-vision")
def to_markdown(text):
    text = text.replace("•", "  *")
    return Markdown(textwrap.indent(text, "> ", predicate=lambda _: True))
response = model.generate_content(image)
to_markdown(response.text)

This initializes a generative model named “gemini-pro-vision” from the google-generativeai library. to_markdown is to format the generated text as Markdown. Then, generate content based on the loaded image.

response = model.generate_content(
    ["Write an explanation based on the image, give the name of the author and the book that it is from", image],
    stream=True
)
response.resolve()
to_markdown(response.text)

Generate content with the specific prompt and the image.

stream=True parameter shows that the response is streamed.
response.resolve() waits for the completion of this streaming response.
result is formatted as Markdown.

Image Processing with Gemini Pro via Python SDK

This session demonstrates the capabilities of Gemini Pro’s code generation capability for image processing purpose. The Colab notebook for this session can be found here: https://github.com/pytee/geminiintro/blob/main/Demo3ImageProcessingGeminiPro.ipynb

Remark: The code of the Colab notebook is similar to the previous notebook, thus a code walkthrough is omitted. The major difference is the model used for this task is Gemini Pro instead of Gemini Pro Vision.

Remark 2: The code generated from the model.generate_content function will be different every time you run. It could generate codes which require troubleshooting. I have run it several times to get a code that can be run without any errors.

# Import the necessary libraries
import cv2
import numpy as np
import matplotlib.pyplot as plt

# Load an image
image = cv2.imread('image.jpg')

# Convert the image to grayscale
gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

# Apply Gaussian blur to the image
blurred_image = cv2.GaussianBlur(gray_image, (5, 5), 0)

# Apply Canny edge detection to the image
edges_image = cv2.Canny(blurred_image, 100, 200)

# Threshold the image
thresh_image = cv2.threshold(blurred_image, 127, 255, cv2.THRESH_BINARY)[1]

# Find contours in the image
contours, hierarchy = cv2.findContours(thresh_image, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)

# Draw the contours on the image
contour_image = image.copy()
cv2.drawContours(contour_image, contours, -1, (0, 255, 0), 2)

# Show the images
plt.subplot(151), plt.imshow(image), plt.title('Original Image')
plt.subplot(152), plt.imshow(gray_image), plt.title('Grayscale Image')
plt.subplot(153), plt.imshow(blurred_image), plt.title('Blurred Image')
plt.subplot(154), plt.imshow(edges_image), plt.title('Edges Image')
plt.subplot(155), plt.imshow(thresh_image), plt.title('Thresholded Image')

plt.show()

Since there is an image at the previous Colab for Image Understanding, you can copy paste the code to run it at the previous Colab. Change ‘image.jpg’ to ‘comic.png’.

From this code generated by Gemini Pro, it shows good capability in generating code with good coding practices, proficiency with various image processing techniques, as well as compatibility on Google Colab environment.

Image Processed with the code generated by Gemini Pro

Image Classification with Gemini Pro via Python SDK

In this section we will generating PyTorch Code for Image Classification with Gemini Pro. The Colab notebook for this session can be found here: https://github.com/pytee/geminiintro/blob/main/Demo4ImageClassification.ipynb

Remark: The code of the Colab notebook is similar to the previous notebook, thus a code walkthrough is omitted.

The script employs model.generate_content to create code based on a specific prompt about writing multiclass classification code in the PyTorch framework using a public dataset intended for use in Google Colab. In this demo, we ask it to use CIFAR-10. The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

Below shows the code generated by Gemini Pro.

import torch
import torchvision
import torch.nn as nn
import torch.optim as optim
from torchvision import transforms, datasets
from torch.utils.data import DataLoader

# Define the device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Define the training and testing transforms
train_transform = transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

test_transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2023, 0.1994, 0.2010)),
])

# Download the CIFAR-10 dataset
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=train_transform)
test_dataset = datasets.CIFAR10(root='./data', train=False, download=True, transform=test_transform)

# Create the data loaders
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=64, shuffle=False)

# Define the model
model = torchvision.models.resnet18(pretrained=False)
model.fc = nn.Linear(512, 10)
model.to(device)

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 10
for epoch in range(num_epochs):
    for i, (images, labels) in enumerate(train_loader):
        images = images.to(device)
        labels = labels.to(device)

        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()

        # Update the weights
        optimizer.step()

        if (i + 1) % 100 == 0:
            print(f'Epoch [{epoch + 1}/{num_epochs}], Step [{i + 1}/{len(train_loader)}], Loss: {loss.item()}')

# Test the model
correct = 0
total = 0
with torch.no_grad():
    for images, labels in test_loader:
        images = images.to(device)
        labels = labels.to(device)

        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f'Accuracy of the model on the test images: {100 * correct / total}%')

Inspecting the code, it generated code compatible to PyTorch and shows good structure.

RandomCrop: This layer randomly crops an image to a smaller size (32x32 pixels in this case) during training. This helps the model learn features that are not specific to the entire image but are present in different parts.
RandomHorizontalFlip: This layer randomly flips the image horizontally with a 50% chance during training. This helps the model learn features that are independent of the object’s orientation.
ToTensor: Converts the image data from a NumPy array to a PyTorch tensor and normalizes the pixel values between 0 and 1.
Normalize: Subtracts the mean and divides by the standard deviation for each color channel (RGB) based on the provided statistics. This helps improve training stability.
Convolutional Layers: These are the core layers of ResNet18. They extract features from the image through convolutions and apply non-linear activation functions (likely ReLU) to introduce non-linearity. ResNet18 has several convolutional layers stacked together in multiple blocks.

The code generated has replaced the final fully connected layer.

Original fc layer (replaced): ResNet18 typically has a fully-connected layer at the end with 1000 outputs, corresponding to the 1000 classes in the ImageNet dataset it’s pre-trained on.
New fc layer: The code replaces this layer with a new fully-connected layer with 10 outputs, matching the number of classes in the CIFAR-10 dataset (airplane, car, etc.). This final layer takes the high-level features extracted from the convolutional layers and classifies the image into one of the 10 categories.

Note: Since the pretrained argument in torchvision.models.resnet18 is set to False, the model is trained from scratch. This means the weights learned in the pre-trained model are not used, and the model learns new features specifically for the CIFAR-10 dataset.

Downloading https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz to ./data/cifar-10-python.tar.gz
100%|██████████| 170498071/170498071 [00:01<00:00, 96073596.50it/s] 
Extracting ./data/cifar-10-python.tar.gz to ./data
Files already downloaded and verified
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=None`.
  warnings.warn(msg)
Epoch [1/10], Step [100/782], Loss: 1.867108941078186
Epoch [1/10], Step [200/782], Loss: 1.6108732223510742
Epoch [1/10], Step [300/782], Loss: 1.431161880493164
Epoch [1/10], Step [400/782], Loss: 1.3744747638702393
Epoch [1/10], Step [500/782], Loss: 1.4931761026382446
Epoch [1/10], Step [600/782], Loss: 1.373394250869751
Epoch [1/10], Step [700/782], Loss: 1.3625941276550293
Epoch [2/10], Step [100/782], Loss: 1.3152024745941162
Epoch [2/10], Step [200/782], Loss: 1.3129165172576904
Epoch [2/10], Step [300/782], Loss: 1.2746528387069702
Epoch [2/10], Step [400/782], Loss: 1.2436751127243042
Epoch [2/10], Step [500/782], Loss: 1.2316663265228271
Epoch [2/10], Step [600/782], Loss: 1.2487694025039673
Epoch [2/10], Step [700/782], Loss: 0.9773951768875122
Epoch [3/10], Step [100/782], Loss: 0.9363090991973877
Epoch [3/10], Step [200/782], Loss: 1.0028033256530762
Epoch [3/10], Step [300/782], Loss: 0.778881847858429
Epoch [3/10], Step [400/782], Loss: 0.8583905100822449
Epoch [3/10], Step [500/782], Loss: 1.0823025703430176
Epoch [3/10], Step [600/782], Loss: 0.7362897992134094
Epoch [3/10], Step [700/782], Loss: 0.8144421577453613
Epoch [4/10], Step [100/782], Loss: 0.9890285730361938
Epoch [4/10], Step [200/782], Loss: 0.9285922646522522
Epoch [4/10], Step [300/782], Loss: 0.8415253758430481
Epoch [4/10], Step [400/782], Loss: 0.8325414657592773
Epoch [4/10], Step [500/782], Loss: 0.9986714720726013
Epoch [4/10], Step [600/782], Loss: 0.6631445288658142
Epoch [4/10], Step [700/782], Loss: 1.0077797174453735
Epoch [5/10], Step [100/782], Loss: 0.726042628288269
Epoch [5/10], Step [200/782], Loss: 1.0857372283935547
Epoch [5/10], Step [300/782], Loss: 0.9091933369636536
Epoch [5/10], Step [400/782], Loss: 0.8824754357337952
Epoch [5/10], Step [500/782], Loss: 0.7253855466842651
Epoch [5/10], Step [600/782], Loss: 0.76465904712677
Epoch [5/10], Step [700/782], Loss: 0.9583697319030762
Epoch [6/10], Step [100/782], Loss: 0.7375239729881287
Epoch [6/10], Step [200/782], Loss: 0.805182933807373
Epoch [6/10], Step [300/782], Loss: 0.8819254040718079
Epoch [6/10], Step [400/782], Loss: 0.7021761536598206
Epoch [6/10], Step [500/782], Loss: 0.67523592710495
Epoch [6/10], Step [600/782], Loss: 0.9843491911888123
Epoch [6/10], Step [700/782], Loss: 0.6799099445343018
Epoch [7/10], Step [100/782], Loss: 0.6445555090904236
Epoch [7/10], Step [200/782], Loss: 0.6973190903663635
Epoch [7/10], Step [300/782], Loss: 0.6598818302154541
Epoch [7/10], Step [400/782], Loss: 0.6572363376617432
Epoch [7/10], Step [500/782], Loss: 0.7315282821655273
Epoch [7/10], Step [600/782], Loss: 0.8872025012969971
Epoch [7/10], Step [700/782], Loss: 0.5378426313400269
Epoch [8/10], Step [100/782], Loss: 0.9714867472648621
Epoch [8/10], Step [200/782], Loss: 0.7257418036460876
Epoch [8/10], Step [300/782], Loss: 0.6902379989624023
Epoch [8/10], Step [400/782], Loss: 0.6892907619476318
Epoch [8/10], Step [500/782], Loss: 0.8644915819168091
Epoch [8/10], Step [600/782], Loss: 0.586513340473175
Epoch [8/10], Step [700/782], Loss: 0.8298776745796204
Epoch [9/10], Step [100/782], Loss: 0.7072436213493347
Epoch [9/10], Step [200/782], Loss: 0.811255156993866
Epoch [9/10], Step [300/782], Loss: 0.6209952235221863
Epoch [9/10], Step [400/782], Loss: 0.6035144329071045
Epoch [9/10], Step [500/782], Loss: 0.5119025707244873
Epoch [9/10], Step [600/782], Loss: 0.606617271900177
Epoch [9/10], Step [700/782], Loss: 0.8304533958435059
Epoch [10/10], Step [100/782], Loss: 0.5963568687438965
Epoch [10/10], Step [200/782], Loss: 0.6931501626968384
Epoch [10/10], Step [300/782], Loss: 0.8494506478309631
Epoch [10/10], Step [400/782], Loss: 0.5403224229812622
Epoch [10/10], Step [500/782], Loss: 0.9768345355987549
Epoch [10/10], Step [600/782], Loss: 0.6757402420043945
Epoch [10/10], Step [700/782], Loss: 0.3573817312717438
Accuracy of the model on the test images: 76.94%

Upon running the code generated, it shows accuracy of test images at 76.94%. It isn’t the best result, but the model’s training loss shows a gradual decrease, showing that it is improving while learning. The output “Files already downloaded and verified” also shows the dataset being setup correctly.

Image Batch Analysis with Gemini Pro Vision & BigQuery

In this section, we will run the demo obtained from this repository from Google Cloud Platform. The instructions of this section is also from the aforementioned GitHub repository. With Gemini Pro Vision, we will be analyzing multiple images of landscape photos.

First, we need to create a Google Cloud Platform project and enable billing.

Enable the Cloud Resource Manager API in your project which will enable Terraform to do its job deploying resources.

Open Cloud Shell and clone the GitHub repository into your project:

gcloud config set project <PROJECT ID>
git clone  https://github.com/GoogleCloudPlatform/generative-ai/
cd ./generative-ai/gemini/use-cases/applying-llms-to-data/using-gemini-with-bigquery-remote-functions

Terraform module is created to bring all three of these steps together into one deployable package. Deploy the Terraform code by running terraform init, terraform plan, and terraform apply as three consecutive commands.

terraform init
terraform plan
terraform apply

The final two commands, terraform plan and terraform apply, will request that you provide your project ID and region. This sample has been tested using region us-central1.

Once the terraform apply command has successfully completed, head to the BigQuery console where you’ll see a few resources within the Explorer pane:

The resources on BigQuery Explorer after deployment

gemini_bq_demo_image: BigQuery Remote Function that prompts the Gemini Pro Vision model.
image_remote_function_sp: A stored procedure containing an SQL query that calls BigQuery Remote Function to analyze the BigQuery Object Table.
image_object_table: BigQuery Object Table that has landmark images.

The following query analyzes the images by passing the image uris (available in the object table) into the gemini_bq_demo_image remote function, which then concatenates the image with the text into a prompt (“Describe and summarize this image. Use no more than 5 sentences to do so”) and makes the call to Gemini.

Results returned from Gemini Pro Vision for the image analysis

We can save the result in BigQuery Table or export the results to Sheets to prevent incurring further charges to get the same result again. Feel free to use your own images for different outputs.