From GenAI Demo to Production Scale — Handling high-throughput use cases.

Olejniczak Lukasz
Google Cloud - Community
10 min readJul 12, 2024

Moving Generative AI (GenAI) demonstrations to production-level systems can be difficult, but recognizing usage patterns early on can simplify this transition. Some applications need fast response times and consistent uptime, while others need to generate large amounts of content efficiently.

This article discusses options available on Google Cloud when high throughput is essential. For example, consider a scenario where you need to create many personalized product descriptions or marketing materials. In this case, low latency may not be the main focus, but ensuring efficient generation at scale is crucial. There is also economic advantage here. By opting for batch processing instead of real-time interactions, on Google Cloud you can achieve a 50% cost reduction. This makes it a compelling choice for scenarios where generating large amounts of content efficiently is paramount.

The two options for production-ready, high-throughput use cases we will cover in this article are:

  • Vertex AI BATCH API
  • BigQuery ML.GENERATE_TEXT

Vertex AI BATCH API

Batch API is supported for the following models:

  • gemini-1.5-flash-001
  • gemini-1.5-pro-001
  • gemini-1.0-pro-002
  • gemini-1.0-pro-001

Let’s dive into building a Batch Job. We’ll use Vertex AI Colab as our working environment, a cloud-based Jupyter notebook environment that provides easy access to GPUs and pre-installed libraries for machine learning and AI development.

For this blog we will use gemini-1.5-flash:

model_name = "gemini-1.5-flash-001"

Batch API uses BigQuery table as source. **UPDATE: Batch API supports JSONL files from Google Cloud Storage now as well:

It expects this table already exists and it expects that this table has a column named request:

Values in this column need to comply with Gemini request JSON schema. Example request below:

record = {
"contents": [
{
"role": "user",
"parts": {
"text": "Give me a recipe for banana bread."
}
}
],
"system_instruction": {
"parts": [
{
"text": "You are a chef."
}
]
}
}

Your input table can have columns other than request. They are ignored for content generation but will be included in the output table.

So let’s put our hands dirty. As prerequisite we need to prepare BigQuery table. There are many ways you can accomplish this but here we will use python BigQuery Client SDK.

Fo our data ingestion to BigQuery step — we need to format data so that BigQuery knows we want to inject prepared record JSON into column named request:

import json 
record_for_bigquery = {"request": json.dumps(record)}
record_for_bigquery
{'request': '{"contents": [{"role": "user", "parts": {"text": "Give me a recipe for banana bread."}}], "system_instruction": {"parts": [{"text": "You are a chef."}]}}'}

Of course we will usually have a number of records. To make this code easy to follow I will start with just one request record in the array of records to be loaded into BigQuery table:

records_for_bigquery = [record_for_bigquery]
records_for_bigquery

Inserting records to BigQuery tables using python BigQuery Client SDK can be handled as follows:

from google.cloud import bigquery


# Construct a BigQuery client object.
client = bigquery.Client(project='genai-app-builder', location = "us-central1")

# Define the table ID (e.g., 'my-project.my_dataset.my_table')
table_id = 'genai-app-builder.genaibatch.batchinput'

job_config = bigquery.LoadJobConfig(
# Optional: Configure write disposition ('WRITE_APPEND', 'WRITE_TRUNCATE', etc.)
write_disposition="WRITE_TRUNCATE"
)

errors = client.insert_rows_json(table_id, records_for_bigquery) # List of dictionaries

errors

We have our testing set available in BigQuery table now.

Next step is to run BATCH Prediction JOB.

Our Batch job will require few details:

  • job name
  • model name
  • bigquery input table
  • bigquery output table
batch_job_name = f"gemini_batch_api"

request_body = {
"displayName": f"{batch_job_name}",
"model": f"publishers/google/models/{model_name}",
"inputConfig": {
"instancesFormat":"bigquery",
"bigquerySource":{
"inputUri" : f"bq://{table_id}"
}
},
"outputConfig": {
"predictionsFormat":"bigquery",
"bigqueryDestination":{
"outputUri": "bq://genai-app-builder.genaibatch.batchoutput"
}
}
}

request_body

We are ready to call BATCH API endpoint to start the job:

import requests
import json
from google.auth import default, transport

# Authentication
creds, project_id = default()
auth_req = transport.requests.Request() # Use google.auth here
creds.refresh(auth_req)

# Prepare the request
url = f"https://us-central1-aiplatform.googleapis.com/v1/projects/{project_id}/locations/us-central1/batchPredictionJobs"
headers = {
"Authorization": f"Bearer {creds.token}",
"Content-Type": "application/json; charset=utf-8",
}

data = json.dumps(request_body)

print(data)

# Make the request
response = requests.post(url, headers=headers, data=data)

# Handle response
if response.status_code == 200:
print("Batch prediction job created successfully!")
print(response.json()) # Print the response (optional)
else:
print(f"Error creating batch prediction job: {response.status_code} {response.text}")

This job will be listed in Vertex AI Batch Prediction view:

When it is done we can check its elapsed time which represent total tile Vertex AI needed to provision compute necessary to coordinate the job and execute a series of calls to Gemini respecting quotas we have set for our Google Cloud project. We can see that to compute this batch consisting with just a single record Vertex AI needed 2 minutes and 18 seconds. In this case majority of this time was consumed waiting for compute resources for the job.

When we go to BigQuery you will see new tables created by the job according to the configuration with generated responses and corresponding requests:

If we double click on the response object you will find both generated responses as well as safety filters scores.

My current quota for Gemini 1.5-flash for online generations is up to 200 requests per minute (requests per minute (RPM) quota applies to a base model and all versions, identifiers, and tuned versions of that model so quota for Gemini-15.flash-001 is the same as quote for the corresponding base model Gemini-1.5-flash). I can find quotas for my model and GCP region using the following filters in Google Cloud console under Quotas and System Limits:

Filters:

aiplatform.googleapis.com/generate_content_requests_per_minute_per_project_per_base_model

Dimensions (e.g. location): base_model:gemini-1.5-flash

Dimensions (e.g. location): region:us-central1

You can also request increasing this quota:

For given quota — 200 RPM — theoretical number of records processed by one hours is 12 000 requests (60 minutes * 200 RPM).

There is similar Quota for batch generation that is not yet available through the console. Temporal workaround until it is available is to open tech support ticket.

I executed BATCH API job with 1, 1000 and 10 000 records and here are the corresponding job elapsed times:

  • 1 records: 2 min 18 sec = 0.5 RPM
  • 1 000 records: 6 min 49 sec = 142 RPM
  • 10 000 records: 48 min = 10 000/48 min = 208 RPM

One important conclusion is that BATCH API is able to adapt to our quotas and job execution times are quite close to theoretical limits for given RPM. This simple test also shows that it makes more sense to use BATCH API with larger record counts. With small numbers — overhead necessary to provision necessary compute resources in large part dominates the total execution time.

There is also economic advantage of using BATCH API. By opting for batch processing instead of real-time interactions, you can achieve a 50% cost reduction compared to online mode. This makes it a compelling choice for scenarios where generating large amounts of content efficiently is paramount.

BigQuery ML.GENERATE_TEXT

BigQuery is an amazing technology to efficiently process any volumes of data. With native support for multimodal Gemini it gained superpowers to process also unstructured data like images, texts, videos, audio.

Calling Gemini models from BigQuery happens through SQL function ML.GENERATE_TEXT.

In this mode we also need our data in BigQuery table, however ML.GENERATE_TEXT is quite flexible on table structure. It just cares that in SQL query where we use ML.GENERATE_TEXT there either column named prompt or derived expression in SQL command which is aliased as prompt. Values here are just text and there is no requirement to structure it as JSON compliant with GEMINI request JSON schema.

ML.GENERATE_TEXT expects us to have BigQuery object of type Model, which will represent Gemini model available through Vertex AI.

Here are the steps to create it. It is a one-time operation — and we will then be able to use this model object in many queries, yet our BigQuery admins will be able to control which users can use it.

First thing is to create CONNECTION. Connection in BigQuery acts as a bridge between BigQuery and external systems like Vertex AI.

It’s a crucial component that simplifies data access and management, providing several key benefits. First of all it helps you specify a service account (technical user) that will represent a BigQuery MODEL object when making connections to the actual model that is hosted on Vertex AI.

Click [Add] button in BigQuery Explorer view:

Select [Connections to external data source]

In next step you need to select [Vertex AI remote models, remote functions and BigLake] and just provide friendly name to this CONNECTION object:

You will see all your connections listed in BigQuery and you can also check service account created to represent this connection:

You will need to assign Vertex AI User role in Google Cloud IAM so that it can call Vertex AI models.

You are ready now to create a BigQuery Model object to represent Gemini 1.5 Flash. The generative AI model linked to a BigQuery model object is determined by its ENDPOINT attribute. Connection we just created need to be specified next to CONNECTION keyword:

CREATE OR REPLACE MODEL
`genai-app-builder.speech2textdataset.vertexaigemini15flash`
REMOTE WITH CONNECTION `genai-app-builder.us-central1.speech2text`
OPTIONS (ENDPOINT = 'gemini-1.5-flash');

When calling Gemini from ML.GENERATE_TEXT function we need to provide three things:

  • which BigQuery model object (so which LLM model) you want to use
  • which table you want to process (it can also be SQL query)
  • prompt

In our tests we will reuse BigQuery table we created for experiments with BATCH API and parse its request column to get only prompt text:

CREATE TABLE `genai-app-builder.genaibatch.batchoutput1000bq` 
AS SELECT * FROM ML.GENERATE_TEXT(
MODEL `genai-app-builder.speech2textdataset.vertexaigemini15flash`,
(
SELECT
JSON_VALUE(request.contents[0].parts["text"]) as prompt
FROM `genai-app-builder.genaibatch.batchinput1`
),
STRUCT(
0.2 AS temperature, 1024 AS max_output_tokens, 0.2 AS top_p,
15 AS top_k, TRUE AS flatten_json_output
)
);

I executed this query with 1, 1000 and 10 000 records:

  • 1 records: 4 seconds
  • 1 000 records: 4 minutes 30 seconds
  • 10 000 records: 58 min = 10 000/58 min = 172 RPM

One observation is that for really small batches the overhead resulting from provisioning compute and coordinating calls to Gemini is way smaller than it is the case for BACTH API. For larger batches, performance is comparable and close to theoretical limits for given RPM quotas. BigQuery ML.GENERATE_TEXT has the same economic advantage as BATCH_API when the corresponding cost for calling Gemini is 50% of what you would pay for online calls. Please note however, that BigQuery has default max query execution time set to 6 hours meaning with RPM set to 200 the job will be able to process up to 6 * 60 minutes * 200 RPM = 72 000 records. BATCH API does not have such constraints.

Summary:

Moving Generative AI demos to production is challenging due to varying requirements. This article explores Google Cloud options for high-throughput scenarios, such as creating personalized product descriptions or marketing materials at scale.

The article focuses on two key options: Vertex AI BATCH API and BigQuery ML.GENERATE_TEXT. Vertex AI BATCH API utilizes BigQuery tables as input, requiring a column named “request” with JSON-formatted requests adhering to Gemini request schema. The article provides a detailed example using the gemini-1.5-flash-001 model and Python BigQuery Client SDK.

BigQuery ML.GENERATE_TEXT is a native SQL function in BigQuery, simplifying Gemini model calls. It requires a BigQuery Model object representing the Gemini model and offers flexibility in table structure. The article demonstrates using ML.GENERATE_TEXT with Gemini 1.5 Flash, emphasizing its flexibility for small and large batches.

Both options offer cost benefits compared to real-time interactions (50% discount), with BATCH API being more suitable for very large datasets due to default BigQuery’s query execution time limit of 6 hours.

Please clap for this article if you enjoyed reading it. For more about google cloud, data science, data engineering, and AI/ML follow me on LinkedIn.

This article is authored by Lukasz Olejniczak — AI Specialist at Google Cloud. The views expressed are those of the authors and don’t necessarily reflect those of Google.

--

--