Pixel guessing : using Gemini Pro Vision with Go

Val Deleplace
Google Cloud - Community
5 min readJan 25, 2024

Let’s have fun with the vision powers of AI!

Gemini is Google’s most powerful publicly available family of AI models. The Gemini Pro models are available in Google Cloud via the Vertex AI web console and via the SDK available in Python, Go, Java, and Node.js.

Gemini Pro Vision is multimodal: you can provide as a prompt either some text, some images, or some videos, or a mix.

A very popular use case for testing the model is word-guessing, where you provide a picture and ask “What does this picture look like?”. The model is pretty good at guessing what you’ve just sketched with a pencil (freehand drawing), or at guessing the contents and the context of a photo.

(yes, I’m an artist)

I was curious to explore the limits of what the model can “see” when we show it too little information: A low-resolution picture.

What can you see here?
What does the AI model see?

⇒ Try the live demo: Pixel Guessing

As a Go developer, I was eager to try and call the Google Cloud Vertex AI API from my Go code, using the brand new Go SDK. Here is how you can try it too!

Prerequisites

  • Create a Google Cloud project (or select an existing one)
  • Visit Vertex AI API, and click ENABLE
  • Install gcloud, the Google Cloud SDK
  • In your dev machine’s terminal, run the following command:
gcloud auth application-default login

and select the same Google account you’re using in the Google Cloud web console.

Now, here is how I created my little demo.

Text editor, with a bit of magic

I’m using VS Code, and for the first time I tried an AI-powered assistant for code generation and code completion. I installed the Google Cloud Code extension, which integrates the Duet AI assistant, and proceeded with the required setup.

After logging in and selecting my project, I was ready to go!

This turned out to save me a lot of time on Go and Javascript coding. High quality code suggestions, taking a lot of relevant context into account. It often felt like it was reading my mind. How did Duet AI know I wanted to write these 4 lines next!! But hey, I’ll take’em. The UI was sometimes wiggly though (no big deal).

For example, here is a great suggestion (gray text) for a standard Go server boilerplate:

or this nice suggestion in Javascript, to handle an HTTP response:

Cloud architecture

My approach is to deploy a Frontend and a Backend that communicate in JSON with the browser, and have the Backend call the Vertex AI service to query the Gemini Pro Vision multimodal model.

The source code of the app (Frontend and Backend) is available at github.com/Deleplace/pixel-guessing .

Frontend

My Frontend consists of HTML, CSS, JS that send requests to the backend to:

  • Generate pixelated images (resized to low-resolution)
  • Ask the Gemini Pro Vision model “What do you see in this pixelated picture?”

Backend

My Go backend is hosted on Cloud Run.

In essence, this is how I’m calling the Vertex AI service:

import "cloud.google.com/go/vertexai/genai"

var prompt = "What does this picture look like? Provide a short answer in less than 8 words."

func guess(ctx context.Context, jpegData []byte) (genai.Part, error) {
client, err := genai.NewClient(ctx, "MY-GOOGLE-CLOUD-PROJECT-ID", "us-central1")
if err != nil {
return nil, fmt.Errorf("unable to create client: %v", err)
}
defer client.Close()

model := client.GenerativeModel("gemini-pro-vision")
model.Temperature = 0.4

img := genai.ImageData("jpeg", jpegData)
res, err := model.GenerateContent(
ctx,
img,
genai.Text(prompt))
if err != nil {
return nil, fmt.Errorf("calling GenerateContent: %v", err)
}
answer := res.Candidates[0].Content.Parts[0]
return answer, nil
}

To degrade an image into a lower resolution, I’m resizing it with the package draw:

import "golang.org/x/image/draw"

func resizeRatio(src image.Image, ratio float32) image.Image {
newWidth := int(ratio * float32(src.Bounds().Max.X))
newHeight := int(ratio * float32(src.Bounds().Max.Y))
dst := image.NewRGBA(image.Rect(0, 0, newWidth, newHeight))
draw.NearestNeighbor.Scale(dst, dst.Rect, src, src.Bounds(), draw.Over, nil)
return dst
}

func resize(src image.Image, newWidth int) image.Image {
ratio := float32(newWidth) / float32(src.Bounds().Max.X)
return resizeRatio(src, ratio)
}

Let’s run the server locally:

GOOGLE_CLOUD_PROJECT=my-project-id-here
go run .

and deploy to Cloud Run:

gcloud run deploy pixel-guess --source=. \
--region=us-central1 \
--max-instances=1

“us-central1” is one of the regions where Gemini is supported. Choosing the same region for Cloud Run and for Vertex AI helps reduce latency.

Cloud Run services are supposed to be stateless. My app stores some state for a few minutes when you upload a picture, resize it into many lower-resolution images, and ask Vertex AI to guess the contents. Instead of using a real database or a persistent file system, my sample app is just keeping the picture in memory. By setting max-instances=1, I’m ensuring that only one instance of the server is currently holding state in memory, and serving requests. Of course that’s fine for a small demo, not for a large scalable production service.

How good are the results?

Well, don’t take my word for it, go ahead and try it :)

Your turn now

What would you build with a multimodal model that accepts images and videos?

--

--

Val Deleplace
Google Cloud - Community

Engineer on cloudy things @Google. Opinions my own. Twitter @val_deleplace