Image to Task Translator using Multimodal Gemini Vision-Pro

Abhik Saha
7 min readFeb 17, 2024

In today’s fast-paced world, the need to streamline complex processes efficiently is paramount. However, one of the challenges faced by businesses is the transformation of visual representations, such as flowcharts, into actionable steps. This often involves manual effort, leading to time and resource constraints. To address this issue, the Gen AI advancements have paved the way for innovative solutions. One such solution is the Image to Task Translator using Multimodal Gemini Vision Pro, powered by Google’s Vertex AI.

Introduction to the Solution

The Image to Task Translator leverages the latest Gemini Vision Pro model provided by Google’s Vertex AI platform. This model integrates advanced machine learning techniques to interpret visual content and generate structured tasks seamlessly. By combining image analysis and natural language processing, the translator converts flowchart images into a structured list of tasks, simplifying complex processes.

import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part
from import storage
import pandas as pd
from xlsxwriter.workbook import Workbook
from datetime import datetime
import configparser
import json
import logging
import os

def is_empty_or_whitespace(s):
# Function to check if a string is empty or consists only of whitespace
return s is None or s.isspace() or not s

def generate_text_response(client, gcs_path, context, task, output_format,default_prompt_flag):

# Initialize Vertex AI
vertexai.init(project=project_name_param, location=region)

# Load the model
multimodal_model = GenerativeModel("gemini-pro-vision")

# Format the context, task, and output_format
#context, task, output_format = f"Context: {context}", f"Task: {task}", f"Format: {output_format}"

# CTF Format used in prompt
Context : Need to convert flowchart to Steps
Task : - Generate a JSON structure representing a list of steps. Each step Desciption must have at least 10 words.
- Additional Steps can be created to highlight relation between each steps.
- Steps must be reference other steps if there are dependencies.
- Each step should be a dictionary with the following keys:
"Step number": an integer representing the step number.
"Step Description": a string describing the step with at least 10 words.
Format :
"Step number":1,
"Step Description": "Passengers sends a request to the reservation system."
"Step number":2,
"Step Description": "The reservation system checks for availability of seats."

# Use default prompt if any of the input values are empty or whitespace
#if any(is_empty_or_whitespace(param.split(':')[1]) for param in [context, task, output_format]):
if default_prompt_flag is True:
prompt = default_prompt
prompt = "\n".join([context, task, output_format])

# Query the model
response = multimodal_model.generate_content(
Part.from_uri(gcs_path, mime_type="image/jpeg"),
# CTF Format used in prompt
# Output configuration
"max_output_tokens": 2048,
"temperature": 0.4,
"top_p": 1,
"top_k": 32

# Return the text part of the response
return response.text

def process_image_file(client, bucket_name, file_name, local_file_path, writer):
# Extracting the sheet name from the file name, truncating to 30 characters
sheet_name = file_name.split('.')[0][:30]
# Extracting the file name without the path
file_name = file_name.split('/')[-1]"Processing file: {file_name} in bucket {bucket_name}")

# Generating Google Cloud Storage and local file paths
gcs_file_path = f"gs://{bucket_name}/{file_name}"
local_image_path = f"{local_file_path}\\{file_name}"

# Calling the function to generate text response based on the image content
text = generate_text_response(client, gcs_file_path, context, task, output_format,default_prompt_flag)

# Evaluating the generated text as a Python expression
final_dictionary = json.loads(text)
# Creating a pandas DataFrame from the dictionary
df = pd.DataFrame(final_dictionary)

# Writing the DataFrame to an Excel sheet
df.to_excel(writer, sheet_name=sheet_name, index=False)
workbook =
worksheet = writer.sheets[sheet_name]

# Formatting Excel sheet with header and cell styles
num_rows, num_cols = df.shape
header_format = workbook.add_format({
'bold': True,
'text_wrap': True,
'valign': 'vcenter',
'align': 'center',
'fg_color': '#00FFFF',
'border': 1,
'font_size': 12,
'font_color': 'black'

# Writing column headers to the Excel sheet
for col_num, value in enumerate(df.columns.values):
worksheet.write(0, col_num, value, header_format)

# Applying conditional formatting to the data cells
cell_format = workbook.add_format({'border': 1})
worksheet.conditional_format(1, 0, num_rows, num_cols - 1, {'type': 'no_blanks', 'format': cell_format})

# Downloading the image file and inserting it into the Excel sheet
bucket = client.bucket(bucket_name)
blob = bucket.blob(file_name)
worksheet.insert_image(num_rows + 2, 0, local_image_path, {'x_offset': 15, 'y_offset': 10, 'x_scale': 0.2, 'y_scale': 0.2})"Processed file: {file_name}")

except ValueError as e:
# Handling JSON parsing error
logger.error(f"Error parsing JSON: {e}")
final_dictionary = None
except Exception as e:
# Handling exceptions and printing an error message
logger.error(f"An error occurred: {e}")
final_dictionary = None

# Checking the result of the evaluation
if final_dictionary is not None:"Successfully evaluated the dictionary")
logger.error(f"Failed to evaluate the dictionary for image file : {file_name}")


def image_to_task_generator(project_name_param, region, bucket_name_param, local_file_fixed_path, output_file_path,output_file_name_prefix,output_file_name_suffix, context, task, output_format):

# Initializing Google Cloud Storage client and generating output file name
client = storage.Client(project=project_name_param)
extract_datetime ="%Y%m%d_%H%M%S")
output_file_name = f'{output_file_path}\\{output_file_name_prefix}_{extract_datetime}_{output_file_name_suffix}.xlsx'

# Initializing Excel writer
writer = pd.ExcelWriter(output_file_name, engine='xlsxwriter')

blob_list = list(client.list_blobs(bucket_name_param))

if not blob_list:
logger.warning(f"No files found in the bucket: {bucket_name_param}")
# Processing each file in the specified bucket
for blob in blob_list:
file_name = str(
_, file_extension = os.path.splitext(file_name)

if file_extension.lower() in ['.jpg', '.jpeg', '.png']:
process_image_file(client, bucket_name_param, file_name, local_file_fixed_path, writer)
logger.warning(f"Invalid file format for {file_name}. Only jpg and png files are supported.")

# Closing the Excel writer and printing the generated output file name
writer.close()"Generated Output File: {output_file_name}")

def load_config():
config = configparser.ConfigParser()'dfd_config.ini')
return config['DEFAULT']
if __name__ == "__main__":
config = load_config()

project_name_param = config.get('project_name', 'calm-hologram-409107')
region = config.get('region', 'us-central1')
bucket_name_param = config.get('bucket_name', 'vertex-ai-dfd-test-bucket')
local_file_fixed_path = config.get('local_file_path', 'C:\\Users\\Username\[2] Gen AI\\Experimentation and POCs\\Final Code\\Temp_path')
output_file_path = config.get('output_file_path', 'C:\\Users\\Username\\[2] Gen AI\\Experimentation and POCs\\Final Code\\Output_path')
log_folder_path = config.get('log_folder_path', 'C:\\Users\\Username\\[2] Gen AI\\Experimentation and POCs\\Final Code\\Log_path')
output_file_name_prefix = config.get('output_file_name_prefix', 'Consolidated Tasks')

# Configure the logging settings
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')
logger = logging.getLogger(__name__)

# Create the log folder if it doesn't exist
os.makedirs(log_folder_path, exist_ok=True)

# Create a file handler with a dynamic log file name and folder path
log_file_name = 'image_to_task_gen_app.log'
log_file_path = os.path.join(log_folder_path, log_file_name)
file_handler = logging.FileHandler(log_file_path, mode='a')

# Create a formatter and add it to the file handler
formatter = logging.Formatter('%(asctime)s - %(levelname)s - %(message)s')

# Add the file handler to the logger

# CTF Format used in prompt designing.Copy from here during custom input
Context : Need to convert Flowchart to Steps

Task : Generate a JSON structure representing a list of steps.
Each step should be a dictionary with the following keys:
"Step number": an integer representing the step number.
"Step Description": a string describing the step.

Format :
"Step number":1,
"Step Description": "Passengers sends a request to the reservation system."
"Step number":2,
"Step Description": "The reservation system checks for availability of seats."

# Uncomment the following lines for user input
context = input("Enter the context: ")
task = input("Enter the task description: ")
output_format = input("Enter the output format (in JSON structure): ")

if any(is_empty_or_whitespace(param) for param in [context, task, output_format]):"Proceeding with default_prompt")
output_file_name_suffix = 'DefaultPrompt'
else:"Proceeding with custom prompt")
output_file_name_suffix = 'CustomPrompt'

# Invoking the image_to_task_generator function with the specified parameters

except Exception as e:
# Log the exception and any additional information you want
logger.error(f"An error occurred: {str(e)}", exc_info=True)
; dfd_config.ini file
project_name = calm-hologram-409107
region = us-central1
bucket_name = vertex-ai-dfd-test-bucket
local_file_path = C:\Users\Username\[2] Gen AI\Experimentation and POCs\Final Code\Temp_path
output_file_name_prefix = DFD Tasks
output_file_path = C:\Users\Username\[2] Gen AI\Experimentation and POCs\Final Code\Output_path
log_folder_path = C:\Users\Username\[2] Gen AI\Experimentation and POCs\Final Code\Log_path

Understanding the Code

Let’s delve into the code that drives the Image to Task Translator:

Initialization and Configuration

The code begins by importing necessary libraries and initializing configurations. It sets up essential parameters such as project details, region settings, bucket names, file paths, and logging configurations.

Prompt Designing and Model Integration

The Image to Task Translator utilizes a Generative Model named gemini-pro-vision from Vertex AI. It constructs a prompt based on the context, task description, and output format provided by the user. Alternatively, it employs a default prompt to guide the model in generating structured tasks from flowchart images.

Processing Image Files

The translator processes each image file in the specified bucket. It extracts text content from the images using the Gemini Vision Pro model and generates structured tasks in JSON format. The tasks are then written into an Excel sheet along with the corresponding images for reference.

Error Handling and Logging

Robust error handling mechanisms are incorporated to manage exceptions during the processing. The code logs relevant information and errors to facilitate troubleshooting and monitoring of the translation process.

Final Output

Generated Output in the First Sheet
Generated Output in the Second Sheet
Different files Generated using the same code for different image sources in GCS bucket
Logs generated during execution in Jupyter Notebook

Possible Use Cases

The Image to Task Translator can be applied across various domains and industries:

  1. Business Process Management: Streamlining complex workflows and procedures.
  2. Software Development: Converting software architecture diagrams into actionable tasks.
  3. Education: Transforming educational diagrams into step-by-step learning modules.
  4. Project Management: Converting project flowcharts into actionable task lists.

Limitations and Future Enhancements

While the Image to Task Translator offers significant advantages, it’s essential to acknowledge its limitations:

- Accuracy: The accuracy of task generation heavily depends on the complexity and clarity of the input flowcharts.
- Model Constraints: The performance of the Gemini Vision Pro model may vary based on the diversity and scale of the input data.
- Resource Intensiveness: Processing large volumes of images may require substantial computational resources.

Future enhancements may focus on improving model accuracy, optimizing resource utilization, and expanding compatibility with diverse file formats.


The Multimodal Gemini Vision Pro represents a significant leap forward in process automation and task management. By harnessing the power of machine learning and image analysis, businesses can streamline operations, improve efficiency, and drive innovation. As technology continues to evolve, solutions like the Image to Task Translator pave the way for a more connected and intelligent future.

Abhik Saha
Data Engineer @Accenture India || Writes about Bigquery, Cloud Function, GCP, SQL