Building a personalized stylist using Gemini & Imagen 2

Amaan Dhamaskar
Google Cloud - Community
6 min readAug 4, 2024

Ever dreamed of having a personal stylist? With the power of Gemini and Imagen, that dream is now a reality.

In this tutorial, we’ll guide you through building your own fashion companion that understands your style, offers recommendations, and even visualizes outfit ideas. Through this process, you will also learn about Google’s state of the art Gemini Models and Imagen - image generation model.

System Architecture

Let’s first understand what we’re trying to build and the game plan for the same.

First, the user uploads their image and provides basic information like occasion & styling preferences. We then proceed to upload this image to Google Cloud Storage Bucket. We then pass the image, along with the required details like occasion and preferences to Gemini 1.5 Flash model, which generates description for the garment.

We then generate a mask over all the clothing articles in the user’s image. This step is necessary for the image generation model to understanding which section of the image needs to be generated. Finally, we pass the user’s original image, masked image as well as the garment description generated by Gemini to Google’s Imagen 2 model, which generates a new image of the user wearing the garment.

Prerequisites

This tutorial is suited for beginners, and basic knowledge of Python is assumed. Prior knowledge of common python libraries like Matplotlib and OpenCV is advised, but not necessary.

The entire code can be run and viewed on Google Colab: https://colab.research.google.com/drive/1V9exbyC9CdNeIqxCN-O3xMGhKlo9zpK-?usp=sharing

Let’s Begin

First, let’s install some necessary python packages. We are installing google-cloud-aiplatform for connecting to Vertex AI; cloths_segmentation to create a mask around the user’s clothing, and finally we import google-cloud-storage along with some helper packages.

!pip install --upgrade --user google-cloud-aiplatform gitpython magika iglovikov_helper_functions cloths_segmentation google-cloud-storage

The installation might prompt you to restart your runtime, post which your new runtime will have all packages installed. Next up, we authenticate our Integrated Python Notebook for connecting it to Google Cloud, without the need of using API Keys or Service Account Keys or any SDK.

import sys

if "google.colab" in sys.modules:
from google.colab import auth

auth.authenticate_user()

We then proceed to initialize Vertex AI, for this, you will require your Project ID along with Location. If you don’t know how to create a Project ID, follow these steps.

PROJECT_ID = "YOUR_PROJECT_ID"  # @param {type:"string"}
LOCATION = "us-central1" # @param {type:"string"}

import vertexai

vertexai.init(project=PROJECT_ID, location=LOCATION)

Now let’s import a bunch of stuff that we need. We’ll understand what each component does, in the next steps of the tutorial.

from vertexai.generative_models import GenerativeModel, Image, Part
from pylab import imshow
import numpy as np
import cv2
import torch
import albumentations as albu
from iglovikov_helper_functions.utils.image_utils import load_rgb, pad, unpad
from iglovikov_helper_functions.dl.pytorch.utils import tensor_from_rgb_image
from cloths_segmentation.pre_trained_models import create_model
import http.client
import typing
import urllib.request
import IPython.display
from IPython.core.interactiveshell import InteractiveShell

We then define some helper functions of our own to load our image from a public Cloud Storage Bucket URL [These steps are optional, and the image can be fed directly to the Gemini models; but we are doing this for code reusability for deployable use cases].

You will need to create a Cloud Storage Bucket and update the access to Public for allUsers.

def get_image_bytes_from_url(image_url: str) -> bytes:
with urllib.request.urlopen(image_url) as response:
response = typing.cast(http.client.HTTPResponse, response)
image_bytes = response.read()
return image_bytes


def load_image_from_url(image_url: str) -> Image:
image_bytes = get_image_bytes_from_url(image_url)
return Image.from_bytes(image_bytes)

We then load our image, and upload it to our newly created bucket.

image = load_rgb("image.jpg")
!gcloud storage cp image.jpg gs://YOUR_BUCKET_NAME

For this tutorial, we will use the following image:

For this use case, we will use Gemini 1.5 Flash, which is Google’s latest Multimodal Language Model. It has a large context window of upto 1M tokens and offers low latency due to its lightweight architecture. You could also use Gemini 1.5 Pro model, which offers better quality results, but we prioritized latency for our use case.

We use Gemini to generate the description of garments, based on the users’ body shape, skin color and hair color. We can also prompt the user to enter the occasion they would like to dress for, dressing preferences, etc. We initialize the model and then pass our image to the model with the following prompt:

Based on the given image, I want you to give me description of garments/clothes that would suit the person for the occassion of an {occasion}. Your response should include the garment description, nothing else.

import json

newimage=load_image_from_url(f"https://storage.googleapis.com/{YOUR_BUCKET_NAME}/image.jpg")

occasion="indian wedding"
prompt = f"Based on the given image, I want you to give me description of garments/clothes that would suit the person for the occassion of an {occasion}. Your response should include the garment description, nothing else."

multimodal_model = GenerativeModel("gemini-1.5-flash-001")
contents = [prompt, newimage]
responses = multimodal_model.generate_content(contents)
print(responses.text)

Here is the response generated by Gemini :

A silk or velvet sherwani in a rich color like maroon, emerald green, or navy blue, paired with matching churidar pants and a turban or safa in a contrasting color. The sherwani could have intricate embroidery or embellishments, and the turban could be decorated with a jewel or a feather. For footwear, a pair of embellished mojaris or jootis would be perfect. A shawl in silk or Pashmina in a contrasting color can be worn over the sherwani for a touch of elegance.

The next step is to create a mask of the user’s clothes. This mask is helpful for our image generation model to identify the sub-region of generation. For this we use and initialize the Unet_2020–10–30 model from cloths_segmentation package.

model = create_model("Unet_2020-10-30")
model.eval();

In the following code, we transform the image as per the model’s requirements and generate the mask.

transform = albu.Compose([albu.Normalize(p=1)], p=1)
padded_image, pads = pad(image, factor=32, border=cv2.BORDER_CONSTANT)
x = transform(image=padded_image)["image"]
x = torch.unsqueeze(tensor_from_rgb_image(x), 0)
with torch.no_grad():
prediction = model(x)[0][0]
mask = (prediction > 0).cpu().numpy().astype(np.uint8)
mask = unpad(mask, pads)

Generated Mask:

Now that we have our user image, mask and the garment description, we will proceed to generate our required image. For this, we will use Imagen 2, which is Google’s best text-to-image generation model. It is a diffusion model which takes a prompt as an input and generates images, based on that prompt.

For our use case, we will use the process of Inpainting. This is a process which selectively generates only a specific region of the image, based on the text prompt provided to the model. The mask is used to specify the region to be generated.

In the following code, we pass the user’s image, along with the mask and the text prompt (Garment Description) to Imagen 2 and set the mode as inpainting-insert. We can optionally decide the number of variations to be generated as well. Finally, we store the output generated by the model under our specified file name.

from vertexai.preview.vision_models import Image, ImageGenerationModel

project_id = "YOUR_PROJECT_ID"
input_file = "image.jpg"
mask_file = "mask.jpg"
output_file = "output.png"
prompt = f"Generate Clothing based on the following description. Leave Spectacles untouched. Description{responses.text}" # The text prompt describing what you want to see inserted.

vertexai.init(project=project_id, location="us-central1")

model = ImageGenerationModel.from_pretrained("imagegeneration@006")
base_img = Image.load_from_file(location=input_file)
mask_img = Image.load_from_file(location=mask_file)

images = model.edit_image(
base_image=base_img,
mask=mask_img,
prompt=prompt,
edit_mode="inpainting-insert",
)

images[0].save(location=output_file, include_generation_parameters=False)

Results

Here is the image generated by Imagen:

Next Steps

To learn more about Google Cloud services and to create impact for the work you do, get around to these steps right away:

--

--