Text from Images with Google’s Gemini Pro Vision.

Developement of a webapp to generate poems from images leveraging Google’s Multimodal LLM, Gemini Pro Vision and Streamlit.

Published in

Write A Catalyst

5 min readFeb 5, 2024

(Disclaimer: This page is solely to show the capability of Google’s Multimodal LLM, Gemini Pro Vision. AI generated contents are not suitable for publication)

Assuming that readers are well aware about Google’s State-Of-The-Art Multimodal LLM Gemini and its variants, this page intends to demonstrate how to make a simple webapp in Streamlit framework using Gemini Pro Vison model with few lines of code.

It is recommended to create a seperate environment to avoid conflicts among libraries.

Step-1: Get your Gemini API key

Visit https://makersuite.google.com/app/prompts/new_freeform. Find and click on button Get API key on top left corner. It will navigate to new page as in below snap.

Ckick on Create API key in new oroject and generate the key. (Do not expose the API key. It may get expaired)

Step-2: Create a .env file

Create a .env file and paste the copied generated API key as GOOGLE_API_KEY= ‘genetated api key‘ . (The API key can be passeed directly also)

Step-3: Create requirements.txt file

Following libraries need to be specified in requirement.txt

'''requirements.txt'''
streamlit
python-dotenv
google-generativeai

Step-4: Install requirements

Open terminal and run pip install -r requirements.txt . If your IDE is Jupyter notebook or Google Colab, you can run the command in the cell itself as !pip install -r requirements.txt .

Step-5: Prepare script

Import libraries:

'''import genai module to access models'''

import google.generativeai as genai

'''import load_dotenv to load environment variables from .env file'''
from dotenv import load_dotenv

'''load environment varibles (GOOGLE_API_KEY)'''
load_dotenv() 

import streamlit as st

'''Import image to load the image to generate content'''
from PIL import Image
import os

Step-6: Configure the generative model

genai.configure(api_key = os.getenv("GOOGLE_API_KEY"))
genModel = genai.GenerativeModel('gemini-pro-vision')

Step-7: Define a function to generate and extarct content

def geniniResponse(prompt, image):
    response = genModel.generate_content([prompt, image])
    return response.text

prompt is the string passed to the model specifying its task.

Step-8: Set up streamlit page

'''Title for webapp'''
st.title('Gemini Poem Generator')

'''Upload Image'''
imageUploaded = st.file_uploader("Upload content image...")

'''initialise image variable'''
image = '' # Dont assign as None.. It may throw error

'''Display image'''
if imageUploaded is not None:
 
    image = Image.open(imageUploaded)
    st.image(image, caption= "Content Image", use_column_width = True)
    #use_column_width = True:To fit the image in the available column width

'''Add a click button to generate the content'''
generate = st.button("Generate content")

Prepare prompt:

The quality of the model response relies on how refined the prompt is. It can be instructed to generate any content from the image such as description, poem, story, summary etc. with proper refined prompt.

#write a prompt as per your requirement
prompt = '''You are an experienced,  world class poet. 
I will upload an image and based on the uploaded imge, 
you have to generate a  poem.'''

Display content:

Once the Image is uploaded and generate button is clicked, the response will be displayed in the webapp with the help of below code.

if submit:
'''get response from the model'''
    response = geminiResponse(image,prompt)
'''write text content on web page'''
    st.write(response)

Step-9: Wrap the codes and deploy the App

Full code (Omitting comments): The file extension should be .py .

'''app.py'''

import google.generativeai as genai
import streamlit as st
from PIL import Image
import os
from dotenv import load_dotenv
load_dotenv()

genai.configure(api_key = os.getenv("GOOGLE_API_KEY"))
genModel = genai.GenerativeModel('gemini-pro-vision')

def geniniResponse(prompt, image):
    response = genModel.generate_content([prompt, image])
    return response.text

st.title('Gemini Poem Generator')
imageUploaded = st.file_uploader("Upload content image...")
image = '' 

if imageUploaded is not None:
    image = Image.open(imageUploaded)
    st.image(image,caption="Uploaded Content Image",use_column_width=True)
generate = st.button("Generate content")

prompt1 = '''You a veteran  poet. I will upload an image 
and based on the uploaded imge, you have to generate a  poem in English'''

prompt2 = '''You are an experienced,  world class poet. 
I will upload an image and based on the uploaded imge, 
you have to generate a  poem.'''

if generate:
    response = geminiResponse(image,prompt1)
    st.write(response)

Deploy the App:

Run the app.py and make sure that it is error free. Go to teminal and run streamlit run app.py . In conda enviroment run the same command in powershell from whre the file is saved. It will redirect to the streamlit app in your default browser as shown below.

Step-10: Upload image and generate content

Upload some images to the model and generate content (poem)

Uploading cover image:

I personally can’t say that it’s a great poem. Maybe my prompt was not that great. Let’s try with another image. (Image by Edward Hopper, Two Comedians (1966). Image courtesy of Sotheby’s.)

Above response looks better. Finally trying with a random image found on internet (Source: https://www.istockphoto.com/):

Summary

I hope this page well explains the capability of Google’s trending Multimodal Gemini Pro Vision. It has done a wonderful job as image to text model. The response of the model can be more precise with a more keenly written prompt. The length, tone and other preferred features of the content also can be specified in the prompt which also highlight the importance of prompt engineering while dealing with LLMs.

✉️ mkk.rakesh@gmail.com