Vertex AI Batch Generation
Gemini with a 50% cost reduction
Gemini allows you to submit multiple multimodal requests as a Vertex AI Batch job. Those run as asynchronous jobs and are great for workloads that do not require an immediate response.
Today, we will explore batch generation, which can process large datasets more efficiently and at a fixed 50% cost reduction.
In a previous article, we already explored another 75% cost-saving method called context caching. If you want to re-watch it, here is the video.
What Is Batch Generation?
Batch generation in Gemini enables the parallel processing of multiple Gemini requests. Instead of sending requests individually, you can bundle them into a single batch.
If you are working with models like Gemini, consider whether your use case requires real-time generation. If not, batch predictions might be an excellent alternative.
There are plenty of use cases where batch generation makes sense.
I used it a couple of weeks ago to transcribe thousands of YouTube Shorts. We processed images and metadata with another customer to create detailed product descriptions for more than 1.000.000 products.
Enough talking, I guess you got the idea.
It all starts with your data.
Vertex AI Batch Text Generation requires input data stored in Cloud Storage or BigQuery. We focus on Cloud Storage for now, but I will also have an example later in the article that shows how you can do that with your data stored in BigQuery.
In the case of Cloud Storage, the data is stored in JSONL format.
JSONL (JSON Lines) is a format where each line is a separate JSON object. This is a hard requirement. The file needs to look like this.
{"request": {"contents": [{"role": "user", "parts": [{"text": "Write a recipe for a breakfast"}]}]}}
{"request": {"contents": [{"role": "user", "parts": [{"text": "Write a recipe for a american BBQ"}]}]}}
Gemini’s ability to seamlessly handle different modalities, such as Text, Documents, Images, Videos, and Audio, makes it an extremely flexible solution for many use cases. In the following, I will show you a few examples of how your batch generation data (JSONL) is supposed to be structured for different modalities. Keep in mind, you can also combine modalities.
Here are a few examples. They all follow the standard Gemini request body.
The parts
array allows you to include multiple input components in a single Gemini request. Each part can be any of the modalities. It allows you to send files (e.g., a document, image, video, or audio) and text for the model. The model then processes all inputs together.
{
"id": "invoice-12345",
"request": {
"contents": [
{
"role": "USER",
"parts": [
{
"fileData": {
"fileUri": "gs://doit-llm/multimodal/invoice.pdf",
"mimeType": "application/pdf"
}
},
{
"text": "Extract the following entities: Invoice Number, Total Amount, and Due Date."
}
]
}
]
}
}
When using this data with batch generation, it must be a flat JSONL keep that in mind.
{"id": "invoice-12345", "request": {"contents": [{"role": "USER", "parts": [{"fileData": {"fileUri": "gs://doit-llm/multimodal/invoice.pdf", "mimeType": "application/pdf"}}, {"text": "Extract the following entities: Invoice Number, Total Amount, and Due Date."}]}]}}
Usage
Before we can run the batch generation job, we need a Google Cloud Storage Bucket. The job uses this bucket to read input and store the output.
You can create the bucket via the Google Cloud Console or gsutil.
gsutil mb doit-llm
#you need to choose a different bucket name as they are globally unique
After that, upload your data from the step before, entering it into that bucket.
Now that your input files are ready and uploaded, it’s time to submit the batch prediction job using the Vertex AI SDK.
from vertexai.preview.batch_prediction import BatchPredictionJob
input_uri = "gs://doit-llm/batch/input/data.jsonl"
output_uri = "gs://doit-llm/batch/output/"
batch_prediction_job = BatchPredictionJob.submit(
source_model="gemini-1.5-flash-002",
input_dataset=input_uri,
output_uri_prefix=output_uri,
)
Where to find the batch generation?
The Batch Text Generation job is listed as a batch job within Vertex AI. https://console.cloud.google.com/vertex-ai/batch-predictions. Unfortunatly, there is no progress indicator. But we can check if the job is still queued
or already being processed
.
The SDK will also automatically print a link to the Vertex AI Batch Job.
If something fails, this is a good starting point. From here, you can open up the Logs. In most cases, the input data is in the wrong format.
I like to poll the status of the job until it is done:
# Monitor the job
while not batch_prediction_job.has_ended:
print(f"Job state: {batch_prediction_job.state.name}")
time.sleep(5)
batch_prediction_job.refresh()
if batch_prediction_job.has_succeeded:
print(f"Job succeeded! Output located at: {batch_prediction_job.output_location}")
else:
print(f"Job failed with error: {batch_prediction_job.error}")
The batch generations are stored back in your output location. They have the following structure. The response object is the standard Gemini API response object.
{
"id": "",
"status": "",
"processed_time": "",
"request": {},
"response": {}
}
Identify your batch generations
Batch generations usually need to be referenced back. You can add an ID to your batch prediction data to do so.
{"id": "invoice-1", "request": {...}}
{"id": "invoice-2", "request": {...}}
{"id": "invoice-3", "request": {...}}
What about tools and configurations?
You can also use tools and configurations with Vertex AI Batch Generation. This allows you to have the same level of control and feature set as if you were using Gemini without Batch Generation.
Tools
Also, tools can be used the same way as we used to. For example, ground the search with Google Search.
{
"id": 1,
"request": {
"contents": [
{
"parts": {
"text": "whats the apple stock price?"
},
"role": "user"
}
],
"tools": [
{
"googleSearchRetrieval": {
"dynamicRetrievalConfig": {
"mode": "MODE_DYNAMIC",
"dynamicThreshold": 0.7
}
}
}
]
}
}
Configurations
With configurations, you can set the generation parameters like temperature and others.
{
"id": 1,
"request": {
"contents": [
{
"parts": {
"text": "Give me a recipe for italian breakfast."
},
"role": "user"
}
],
"generationConfig": {
"temperature": 1
}
}
}
The full code for this article is available on GitHub
Conclusion
Batch generation with Gemini in Vertex AI is a powerful tool for processing large datasets efficiently and at a lower cost. With the ability to handle multimodal data and batch multiple requests together, you can save time and resources. Here are the key takeaways:
- 50% Cost Reduction: Enjoy significant savings with batch generation, which reduces costs compared to standard predictions.
- Multimodal Flexibility: Process text, images, audio, and video in a single request, all within one batch.
- Data Source Options: Supports both Cloud Storage and BigQuery, with input in the required JSONL format for Cloud Storage.
- Easy Setup: Simple instructions to set up batch prediction using Vertex AI SDK, Cloud Storage, and BigQuery.
- Customizable Generation: Adjust parameters like temperature and other settings to fine-tune your batch predictions.
Thanks for reading and listening
I appreciate your feedback and hope you enjoyed the article.
You can find me on LinkedIn. Even better, subscribe to my YouTube channel ❤️.