Create multimodal conversational experiences with Google Cloud Dialogflow CX and Gemini Vision

Gabriele Randelli
11 min readApr 5, 2024

--

Boosting your digital assistant by analyzing images with Gemini Vision

Create multimodal conversational experiences with Google Cloud Dialogflow CX and Gemini Vision

Introduction

Digital assistants are increasingly prevalent in our lives. Almost every company has deployed vocal or chat bots to improve customer service and reduce their operational costs.

While speech-to-text technologies are getting better over time and leading to more natural conversations with bots, people often prefer non-verbal communications. As an example, there are specific scenarios where a photo is worth hundreds of words.

Dialogflow CX is Google Cloud’s technology to implement omnichannel virtual agents. Alongside the possibility to create intent-based deterministic flows, Dialogflow has recently evolved thanks to the Generative AI foundation models released by Google, namely PALM 2 and Gemini.

While Dialogflow CX can natively integrate with Google’s text-to-text Large Language Models through generators, the main scope of this article is to showcase how Gemini 1.0 Pro Vision can extend your agent with powerful multimodal capabilities. Customers submit an image and delegate the bot for extracting the required information, rather than manually input data one-by-one through manifold conversational rounds. Applications are manifold: banking to extract PII data, insurance to validate car damanes, and so on.

The rest of this article provides a primer to implement such an extension, sketching out the different components and providing an example applied to a car rental bot.

Customer User Journey

Before digging into technical aspects, it’s worth clarifying the expected customer user journey:

  • At a specific conversational round, the chat bot asks the customer for uploading the image to be analyzed;
  • The customer selects a file from her own device, which is in turn transmitted from the chat widget to the bot on the server-side;
  • The bot invokes a back-end component, whose role is to interact with the large language model;
  • The back-end component submits the image to the multimodal language model, alongside with a prompt to extract the desired information;
  • The output of the LLM is passed back to the back-end, which validates the driving license and return a response code to the bot;
  • The bot fulfills the customer’s request by transmitting a verbose response to the chat widget;
  • The chat widget displays the extracted details and asks the customer for a confirmation.
Flowchart of the customer user journey with Dialogflow CX and Gemini
Flowchart of the customer user journey with Dialogflow CX and Gemini

Architecture

The architecture to implement our customer user journey consists of five main technologies:

Below is reported an architecture diagram:

High-level architecture diagram with the main GCP components
High-level architecture diagram with the main GCP components

Let’s start by creating the staging GCS bucket with the following characteristics:

Assign the bucket whatever name you prefer. Pick up your preferred region (just ensure that you’ll be using the same region for the rest of this article). You’re going to re-use the bucket later on.

Dialogflow Messenger

Dialogflow Messenger is a lightweight web client for Dialogflow agents. It enables a seamless integration with DFCX bots and it also supports the most recent Generative AI features available.

Let’s start importing the Dialogflow Messenger JS library:

<script src="https://www.gstatic.com/dialogflow-console/fast/df-messenger/prod/v1/df-messenger.js">
</script>

The first step is to instantiate the messenger widget in the body section of your HTML page and to enable the file upload feature:

<df-messenger
project-id="[your-project-id]"
agent-id="[your-agent-id]"
language-code="en"
allow-feedback="all">
<df-messenger-chat-bubble
chat-title="Car Rental"
enable-file-upload>
</df-messenger-chat-bubble>
</df-messenger>
<style>
df-messenger {
--df-messenger-input-inner-padding: 0 48px;
}
</style>

Heads-up: the padding in the style tag is required to make room for the upload button and avoid rendering issues.

Next, we need to specify the desired GCS upload bucket, using the following javascript segment:

globalThis.dfInstallUtils({
'gcs-bucket-upload': {bucketName: '[your-bucket-name]'},
});

Write down the previously created bucket name in the bucketName attribute.

Let me quickly go through the execution flow:

  • Once the user selects a file, a df-upload-file-selected event is fired;
  • Without any custom listener to intercept the event, the default behavior is to upload the file to the GCS bucket via REST API;
  • Upon success, an info message pops up in the widget and a new df-file-upload-completed event is fired;
  • (Optional) Feel free to intercept this latter with an event listener if you want to run custom post-upload processing.

The uploaded file path in GCS is stored as a query parameter for the DF agent, with the following format:

parameters: {
files: [
'bucket-id/session-id/random-id_file-name',
]
}

Heads-up: the query parameters are not automatically sent to the agent. They will be included in the next request.

That’s why the last step that you need to implement to send out the query parameter is to push a hidden intent. Let’s implement this step with an event handler:

dfMessenger.addEventListener('df-file-upload-completed', function (event) {
dfMessenger.sendRequest('query','file uploaded');
});

With this request we’re pushing a “file uploaded” utterance, without any user action, which will be later on detected by the agent (see next section).

Gemini 1.0 Pro Vision

Gemini is Google’s family of Large Language Models, released in December 2023. Gemini has been conceived with a multimodal approach from scratch. In a nutshell, a multimodal LLM takes as input text, images or videos and outputs text.

Within the scope of this article, we are going to use a multimodal LLM to pass a driving license photo and ask the model to automatically extract PII data to check the validity of the license and the minimum required age to rent a car (in our example, this latter must be 21).

You are going to interact with the Gemini Pro Vision API via a Cloud Function, Google’s Function-as-a-Service (FaaS) technology. However, our ultimate goal is to call this function from Dialogflow via a webhook. Thereby, we need to follow the expected Dialogflow request and response format.

Let’s start from creating the cloud function with the following settings:

  • Name: driving-license-webhook
  • Require HTTPS
  • Timeout: 60 seconds (do not decrease this value or you may experience timeouts due to Gemini)
  • AppEngine default service account
  • Allow all traffic

Leave the rest of the settings to their default values. Select Python 3.12 as the runtime environment.

Fill in the requirements.txt file with the following dependencies:

# Function dependencies, for example:
# package>=version
google-cloud-aiplatform>=1.38

Fill in the main.py file with the following code (heads up: you need to edit some parts within the code block):

from datetime import datetime
from dateutil.relativedelta import relativedelta
import vertexai
from vertexai.preview.generative_models import GenerativeModel, Part

def submit_multimodal_prompt(project_id: str, location: str, image_path: str, prompt: str) -> str:
# Initialize Vertex AI
vertexai.init(project=project_id, location=location)
# Load the model
multimodal_model = GenerativeModel("gemini-pro-vision")
# Query the model
response = multimodal_model.generate_content(
[
# Add an example image
Part.from_uri(image_path, mime_type="image/jpeg"
),
# Add an example query
prompt,
]
)

print("Gemini Output: " + response.text)
return response.text

def check_expiration_date(project_id: str, region: str, file_path: str):
# run multimodal prompt
prompt = "extract the expiration date from the driving license"
exp_date_str = submit_multimodal_prompt(project_id,region,file_path,prompt)

# check whether the driving license is still valid based on the expiration date
exp_date = datetime.strptime(exp_date_str.replace(" ", ""), '%m/%d/%Y')
now_date = datetime.now()

if exp_date >= now_date:
return True

return False

def check_driver_age(project_id: str, region: str, file_path: str):
# run multimodal prompt
prompt = "extract the date of birth from the driving license"
dob_str = submit_multimodal_prompt(project_id,region,file_path,prompt)

# check whether the driver is > 21 years, as requested from some car rental companies
dob_date = datetime.strptime(dob_str.replace(" ", ""), '%m/%d/%Y')
now_date = datetime.now()

difference_in_years = relativedelta(now_date, dob_date).years

if difference_in_years >= 21:
return True

return False

def validate_driving_license(request):
"""Responds to any HTTP request.
Args:
request (flask.Request): HTTP request object.
Returns:
The response text or any set of values that can be turned into a
Response object using
`make_response <http://flask.pocoo.org/docs/1.0/api/#flask.Flask.make_response>`.
"""
req = request.get_json()
file_path = 'gs://' + str(req["sessionInfo"]["parameters"]["files"][0])

# CHANGE THESE PARAMS
project_id = 'your-project-id'
region = 'your-region'

# check expiration date
is_exp_date_valid = check_expiration_date(project_id, region, file_path)

# check minimum driver age
is_driver_age_valid = check_driver_age(project_id, region, file_path)

is_valid = "KO"
if is_exp_date_valid is True and is_driver_age_valid is True:
is_valid = "OK"
elif is_exp_date_valid is False:
is_valid = "KO_EXP"
else:
is_valid = "KO_AGE"

res = {"sessionInfo": {"parameters": {"response": is_valid}} }
return res

Let’s analyze the code step by step:

  • The cloud function’s entry point is validate_driving_license(request). Dialogflow passes the input parameter request via the webhook with a predefined format. We can extract from this payload the relevant session parameters, such as the path to the image;
  • There are two validation steps: the driver age and the driving license validity;
  • Both these steps leverage the submit_multimodal_prompt() method, where the Gemini Vision API is invoked and the real magic happens;
  • According to the check results, the corresponding status code is stored in the response session parameter and returned via the webhook to the bot.

Dialogflow CX Car Rental Prebuilt Agent

Let’s now turn to defining the agent and integrating the various components.

For simplicity, let’s import one of the prebuilt agents available in Dialogflow CX, a car rental bot. Later on, you are going to extend the agent itself with a small set of pages and routes.

You can deploy the initial prebuilt agent by following this guide. Select Travel: car rental agent from the list of available agents.

The prebuilt agents menu in Dialogflow CX
The prebuilt agents menu in Dialogflow CX

Name the agent, select global location and CX flow as default resource. The import agent is below reported.

The imported pre-build Dialogflow CX car rental agent
The imported prebuilt Dialogflow CX car rental agent

Agent Extension

You are going to extend the imported agent with a couple of new pages to add the back-end logic so far developed. The final expected result is below highlighted with a red bounding box.

The extended version of the previously imported agent
The extended version of the previously imported agent

The first page, Acquire Driving License, is in charge of collecting the customer’s driving license and triggering its validation. This page complements what we’ve described in the Dialogflow Messenger section.

This page contains an intent route based on the file.upload intent, which maps the utterance we’ve just pushed from the chat widget (“file uploaded”).

When Dialogflow detects this intent, both the image file and the image path are available and stored respectively in the GCS bucket and in the DF query parameter.

Intent route in the Acquire Driving License page

As a result, the driving-license-webhook (described in the next section) is invoked. Upon receiving the webhook’s response, there is an automatic transition to the next page: Validate Driving License.

Transition from first page to second page

The second page, Validate Driving License, is in charge of parsing the webhook’s response and triggering the corresponding behavior. This is accomplished by setting two conditional routes.

Conditional routes in the Validate Driving License page
Conditional routes in the Validate Driving License page

Both the conditions rely on the webhook’s response, which can be retrieved using the $session.params.response session parameter.

If the webhook returns “OK”, the page will just report a fulfillment and move to the next page.

Fulfillment when the driving license is valid

If not, since the webhook also returns a specific error code, we apply a conditional response to describe the precise problem. Given the very specific problem, we handover its resolution to a live agent.

if $session.params.response = "KO_AGE"
Unfortunately our company requires the minimum age for car rental to be 21. Let me transfer you to an agent.
elif $session.params.response = "KO_EXP"
Unfortunately your driving license is expired. Let me transfer you to an agent.
else
Unfortunately your driving license is not valid. Let me transfer you to an agent.
endif

Webhook

A webhook is a service that hosts your business logic or calls other services. It’s the suggested approach to leverage back-end services. In our case, the webhook triggers the cloud function previously defined, which in turn invokes the Gemini Vision API. We could even directly call this latter from the webhook itself, but relying on a cloud function fosters better decoupling between the bot and the back-end and allows us to implement the driving license validation.

Let’s create the driving-license-webhook that will invoke the corresponding cloud function by going to Manage, Webhooks and click on Create.

Fill in the following settings:

Click on Save.

Last, let’s bind the webhook to the conversational page. Click on Build and go back to the bot design UI. Open the Acquire Driving License page, select the file.upload route and from the right panel select Webhook settings, then Enable webhook. Select driving-license-webhook from the combo box.

The webhook triggered from the Acquire Driving License page

This is the last step of our integration and the whole customer journey is ready to be tested.

Summary

Generative AI is influencing manifold technologies, digital assistants included. Dialogflow CX already integrates LLMs with built-in features (e.g. generative fallbacks, generators, playbooks).

The scope of this article is to further extend this integration by adopting multimodal LLMs, such as Gemini 1.0 Pro Vision, to create an even more natural interaction between customers and bots. With multimodal customer experience, you can upload images and automatically extract info, without any human intervention in boring conversational rounds.

Below you can find additional resources, and a basic implementation of this article published on my GitHub repository. I strongly encourage you to deploy the code and further customize it according to your use case.

Stay tuned for more material around conversational AI!

What’s Next?

Documentation

Overview of Generative AI on Vertex AI

Overview of Multimodal Models

Dialogflow CX Basics

Dialogflow CX Webhooks

Github Sources

https://github.com/grandelli/dfcx-geminiprovision (the main GitHub repository for this project — contains the modified DFCX agent, the cloud function and the Dialogflow Messenger client)

Acknowledgements

Special thanks to Vojin Katic for reviewing this article, Giorgio Conte for our joint collaboration on a similar project and Matteo Consoli for unconsciously influencing me to get back to writing.

Feel free to leave your comments here or connect with me on LinkedIn.

--

--

Gabriele Randelli

Customer Engineer @GoogleCloud. Fond of Machine Learning. Co-founder of the Italian Association for Machine Learning.