How to prompt Gemini asynchronously using Python on Google Cloud

Published in

Google Cloud - Community

5 min readJun 27, 2024

Gemini is a family of multimodal large language models, developed by Google DeepMind. They are currently (June 2024) Google’s most capable models of this type. Gemini is also the name of a consumer facing web application that uses the Gemini models, similar to that other one that people talk about, but here I talk about the models.

In this post I’d like to show a solution to a common problem that occurs when you have a long list of questions or prompts to ask of your model. For example, you may want to evaluate a prompt template with a set of questions and compare them to a set of answer.

The problem here is that running your test set of questions through the model can take a long time, if you send them one by one. It would be much more efficient if you could send all questions at once, and then wait for all the answers to come back. This is called “prompting the model asynchronously”.

Here’s how you can prompt Gemini 1.5 pro version 001 in a synchronous manner:

import vertexai
from vertexai.generative_models import GenerativeModel, Part, FinishReason
import vertexai.preview.generative_models as generative_models

def generate(prompt, my_project):
  vertexai.init(project=my_project, location="us-central1")
  model = GenerativeModel(
    "gemini-1.5-pro-001",
  )
  responses = model.generate_content(
      [prompt],
      generation_config=generation_config,
      safety_settings=safety_settings,
      stream=True,
  )

  return "".join([response.text for response in responses])

generation_config = {
    "max_output_tokens": 8192,
    "temperature": 1,
    "top_p": 0.95,
}

safety_settings = {
    generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
}

my_project = 'YOUR_PROJECT_HERE'
print(generate("hello", my_project))

The above code requires you to specify “my_project” which would be your Google Cloud project, and that’s where your calls to Gemini are billed. It will probably print something like:

Hello! 👋 How can I help you today? 😊

I just tried it and it took about 2 seconds. Now suppose you wanted to learn a few recipes, you could prompt Gemini for this. These are longer responses that take a bit longer to generate: The examples below typically take about 12 seconds each. Now imagine you wanted to learn how to cook your 8 favorite Dutch food items:

foodstuffs = ['zure haring', 'stroopwafels', 'hutspot', 'stamppot rauwe andijvie', 'zoute drop', 'poffertjes', 'bitterballen', 'oliebollen']

[generate(f'give me a recipe for {f}') for f in foodstuffs]

This will take around 8 times 12: 96 seconds to execute. That may not be a very long time, but if you wanted to get a collection of 100 Dutch recipes, or maybe 100 questions to validate the answers, it would take about 20 minutes. What if we could send these prompt in parallel?

We will use the packages asyncio and tenacity to define an asynchronous version of the “generate” function, here’s the code:

import asyncio
from tenacity import retry, wait_random_exponential

import vertexai
from vertexai.generative_models import GenerativeModel, Part, FinishReason
import vertexai.generative_models as generative_models

@retry(wait=wait_random_exponential(multiplier=1, max=120))
async def async_generate(prompt, my_project):
  vertexai.init(project=my_project, location="us-central1")
  model = GenerativeModel(
    "gemini-1.5-pro-001",
  )
  response = await model.generate_content_async(
      [prompt],
      generation_config=generation_config,
      safety_settings=safety_settings,
      stream=False,
  )

  return response.text


generation_config = {
    "max_output_tokens": 8192,
    "temperature": 1,
    "top_p": 0.95,
}

safety_settings = {
    generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
}

With respect to the earlier code:

We define the generate method with the async keyword so that is can be run asynchronously
We call “model.generate_content_async” from inside the async_generate function to get actual asynchronous behaviour
In order to get the response from “model.generate_content_async”, we have to add the “await” keyword when we call it
We added a “retry” decorator from the tenacity package. If the call to the model fails, this will cause it to try again. This can easily happen since your project will have a quota to prevent abuse and cost overruns, that limits the number of calls per minute that you can make to the Gemini model, and what we want to do is to send multiple (many?) calls, without waiting for each of the answer first. This specific code will retry a random amount of time, so that if multiple calls hit the quota at the same time, they are not also retried at the exact same time, which would probably just hit the quota again. The wait time is drawn from an exponential distribution between 1 second and 2 minutes.

We can now use this function get a recipe for poffertjes. From a Jupyter notebook you can do:

my_project = 'YOUR_PROJECT_HERE'
response = await async_generate("give me a recipe for poffertjes", my_project))
print(response)

But if you are executing this code as a stand-alone Python program, you need to start the asyncio event loop yourself (in a notebook, Jupyter has done it for you):

my_project = 'YOUR_PROJECT_HERE'
response = asyncio.run(async_generate("give me a recipe for poffertjes", my_project))

print(response)

In either case, this takes again about 12 seconds, so you can expect 96 seconds for 8 recipes. But what we would like to do, is to request all recipes in parallel, which you can do as follows (in a Jupyter notebook):

my_project = 'YOUR_PROJECT_HERE'
get_responses = [async_generate(f'give me a recipe for {f}', my_project) for f in foodstuffs]
recipes = await asyncio.gather(*get_responses)

And this will take about 17 seconds to get all your recipes. And again, outside of the Jupyter environment, you will have to start the event loop yourself:

async def main():
    foodstuffs = ['raw herring', 'stroopwafels', 'hutspot', 'stamppot rauwe andijvie', 'zoute drop', 'poffertjes', 'bitterballen', 'oliebollen']
    get_responses = [async_generate(f'give me a recipe for {f}', my_project) for f in foodstuffs]
    return await asyncio.gather(*get_responses)

responses = asyncio.run(main())
for r in responses:
    print(r)

Enjoy!

Troubleshooting

The first thing to note is that this program should respond in about 20 seconds. If nothing happened after one minute, something is wrong! It’s not going to get better by waiting much longer, I’m afraid.

The reason the program can take a long time, is that we have added an automatic retry loop. This is really useful when you send lots of prompts and you hit your quota of queries per minute. But if there is a bug and all prompts fail, you’re just waiting a long time for all the retries and variable wait times to complete.

Therefore, if your program doesn’t seem to respond, comment or remove the line that starts with “@retry(…”. That way, each prompt will only be tried once, and if it fails, you will see an error.

Here’s the full version of the asynchronous program (but you still have to replace “YOUR PROJECT HERE” with your project ID):

import asyncio
from tenacity import retry, wait_random_exponential

import vertexai
from vertexai.generative_models import GenerativeModel, Part, FinishReason
import vertexai.generative_models as generative_models

#@retry(wait=wait_random_exponential(multiplier=1, max=120))
async def async_generate(prompt, my_project):
  vertexai.init(project=my_project, location="us-central1")
  model = GenerativeModel(
    "gemini-1.5-pro-001",
  )
  response = await model.generate_content_async(
      [prompt],
      generation_config=generation_config,
      safety_settings=safety_settings,
      stream=False,
  )

  return response.text

generation_config = {
    "max_output_tokens": 8192,
    "temperature": 1,
    "top_p": 0.95,
}

safety_settings = {
    generative_models.HarmCategory.HARM_CATEGORY_HATE_SPEECH: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_SEXUALLY_EXPLICIT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
    generative_models.HarmCategory.HARM_CATEGORY_HARASSMENT: generative_models.HarmBlockThreshold.BLOCK_MEDIUM_AND_ABOVE,
}


my_project = 'YOUR PROJECT HERE'

async def main():
    foodstuffs = ['raw herring', 'stroopwafels', 'hutspot', 'stamppot rauwe andijvie', 'zoute drop', 'poffertjes', 'bitterballen', 'oliebollen']
    get_responses = [async_generate(f'give me a recipe for {f}', my_project) for f in foodstuffs]
    return await asyncio.gather(*get_responses)

responses = asyncio.run(main())
for r in responses:
    print(r)

If you save this file to “test.py” then you:

authenticate using gcloud auth application-default login
execute the program using python test.py

Let me know in the comments if you’re still stuck and I’ll try to respond. Good luck!

How to prompt Gemini asynchronously using Python on Google Cloud

Troubleshooting

Written by Paul Balm