A Piscean’s take on Gemini

Published in

Google Cloud - Community

5 min readDec 16, 2023

Building a quick multimodal recommendation app with Gemini

Yellow Taj Mahal image generated by Imagen — Imagen generated image of Yellow Taj Mahal

The image has no significance to the application (except for the fact that it is an AI generated image, Vertex AI Imagen). But being a Piscean, I wanted the start to be magical!

Before I begin: Opinions, ideas and examples in the blog are my own and derived from my interest and experience, not representing my employer, other company or anyone else. Also, I am NOT soliciting support to any specific sports team, I just used 2 teams names randomly, no preference whatsoever.

What are we building?

We are a team of data scientists in an apparel store tasked with recommending the right clothing to pair with a given piece of apparel. We don’t want to set up the whole datastore here, as we are just doing a proof-of-concept.

How are we building?

Gemini Pro Vision generative model!!! It is Google’s multimodal generative AI model that accepts text, images and videos as input and generates text in response. We will use Google Cloud’s Vertex AI Gemini API that provides a unified interface to interact with the Gemini models. In order to invoke the Vertex AI Gemini API, we will use the Vertex AI SDK for Python.

Why Gemini?

(based on my opinion & personal experience)

First of all, great question! When there are so many options, why this? Gemini Vision Pro stunned me with the responses. I have to talk about this one particular response, where I used a multimodal input consisting of a text prompt along with an AI generated image (the image was blurred on purpose to not have clear team names or numbers). I purposely misguided the text prompt with inaccurate information about the image. But the model responded that the detail provided wasn’t correct. It also figured out the right detail from that blurred image. I have highlighted this in the hands-on part of this blog.

So let’s go?

Install the Vertex AI SDK for Python as shown in the doc and authenticate your account.

#Install 
!pip install --upgrade google-cloud-aiplatform
from google.cloud import aiplatform
import vertexai.preview

#Authenticate
from google.colab import auth
auth.authenticate_user()

#Restart Kernel
import IPython
app = IPython.Application.instance()
app.kernel.do_shutdown(True)

Set your Project ID and REGION variables:

#Set PROJECT_ID and REGION variables
import vertexai.preview
import vertexai
PROJECT_ID = "bright-proxy-***"  # your project id
REGION = "us-central1" 

vertexai.init(project=PROJECT_ID, location=REGION)

Import model dependencies, in this case, Vertex AI generative model gemini-vision-pro:

from vertexai.language_models import TextEmbeddingModel
from vertexai.preview.generative_models import GenerativeModel, Image
vision_model = GenerativeModel("gemini-pro-vision")

Load the images that we’ll use in the Multimodal prompt input (I had generated these images using an older version of Imagen).

!curl -o image.jpg --silent --show-error https://storage.googleapis.com/img_public_test/data%20files/Apparel/RCB.JPG
image = Image.load_from_file("image.jpg")
image

This is where I was mind-blown! I entered the below text as prompt along with the above image:

vision_model.generate_content([image, "Royal Challengers Bangalore T shirt"])

Check the screenshot below for prompt and response ! It says “This is not a Royal Challengers Bangalore jersey. This is a Kolkata Knight Riders jersey”. 😳

…and that too from a purposely unclear image.

Alright now! Getting back to business!

Let’s ask the model to pick a matching pair of shorts from a couple of images, shall we?

!curl -o image1.jpg --silent --show-error https://storage.googleapis.com/img_public_test/data%20files/Apparel/orange_shorts1.JPG
image1 = Image.load_from_file("image1.jpg")

!curl -o image2.jpg --silent --show-error https://storage.googleapis.com/img_public_test/data%20files/Apparel/white_shorts1.JPG
image2 = Image.load_from_file("image2.jpg")

These images are also AI generated:

Images of shorts pasted side-by-side for easy reference

Prompt for the recommendation:

#Generate description for the image in structured format with color, size and category

prompt = [
    "Describe the following image in terms of color, size - whether it is small, medium or large, type of apparel - whether shirt or trousers or half pants etc. and what category of people it suits - whether kids or adult men or women? Do not include any brand names.",
    image,
    "Also for the generated description, can you select the most suitable matching apparel from the following images",
    image1,
    image2,
    "and provide the justification for it"
]

response = vision_model.generate_content(prompt)

response.text

Here is the response as seen in the screenshot below:

“The image shows a man wearing a half-sleeved t-shirt. The t-shirt is orange, yellow and purple in color. It has a collar and buttons. The t-shirt is medium in size. It is suitable for adult men. The most suitable matching apparel is the orange shorts. The shorts are the same color as the t-shirt and they are also medium in size.”!!! ☺️

Response for the apparel match recommendation demo application

That’s all I had for now!

I liked the fact that the model considered the color and size attributes (even though relative) as reference for the recommendation in response. I also enjoyed how easy it was for me to get this to work. I literally only spent 10 minutes on the code and 10 minutes on this blog. That also solidifies the reasoning for my choice of product I guess. If you want to learn more about this, or how to build an application with it, here you go. Ok, bye for now.