Unlock Hidden Gems: 4 Vertex AI & Gemini Secrets for Production GenAI Success
Ready to take your ML projects to production, but feeling a bit overwhelmed by the complexities? Vertex AI has your back. As Google Cloud’s fully-managed and unified AI development platform, Vertex AI is brimming with underutilized features specifically designed to smooth the path from development to deployment.
In this post, we’ll shine a light on four lesser-known aspects of Vertex AI that have the potential to transform your production workflows.
Larger short term memory: 2-million-token context window
Generative models have traditionally been limited to processing short contexts of 8k tokens or less. While recent advancements have pushed these boundaries to 32k or even 128k tokens, developers still face challenges in managing and optimizing these larger context windows in production environments, often resorting to techniques like chunking or RAG.
Gemini 1.5 Pro breaks boundaries, handling an astounding 2 million tokens — many multiples larger than other leading models! To put that in perspective, imagine processing:
- 50,000 lines of code (with the standard 80 characters per line)
- All the text messages you have sent in the last 5 years
- 8 average-length English novels
- Transcripts of over 200 average length podcast episodes
This massive context window empowers developers to build AI applications that possess a far larger understanding of the information they process, often eliminating the need for techniques like chunking or relying on external knowledge bases (RAG).
How can this help your use case? Lets say you want to build a legal research tool that could provide the entire text of a complex legal case as context, allowing the model to answer nuanced questions about the case without needing to reference external documents or databases. This is where you can leverage the 2m token context window.
Reduce the cost of requests: Context caching
Longer contexts can indeed raise costs, but Vertex AI’s context caching offers a solution. Context caching allows you to reuse previously processed content, reducing the number of tokens the model needs to process for each new request.
Let’s say you have a GenAI application that helps students learn about foundation models. You want to provide some of the latest research papers as context in PDF format. This is where context caching comes into play. First, you will need to create a context cache:
import vertexai
from vertexai.preview.generative_models import GenerativeModel
from vertexai.preview import caching
# TODO(developer): Update and un-comment below line
# project_id = "PROJECT_ID"
vertexai.init(project=project_id, location="us-central1")
system_instruction = """
You are an AI research assistant teaching students. You always stick to the facts in the sources provided.
Now look at these research papers, and answer the following question:
"""
contents = [
Part.from_uri(
"gs://cloud-samples-data/generative-ai/pdf/2312.11805v3.pdf",
mime_type="application/pdf",
),
Part.from_uri(
"gs://cloud-samples-data/generative-ai/pdf/2403.05530.pdf",
mime_type="application/pdf",
),
]
cached_content = caching.CachedContent.create(
model_name="gemini-1.5-pro-001",
system_instruction=system_instruction,
contents=contents,
)
Once you create the context cache object, you can use REST APIs or the Python SDK to reference content stored in a context cache in your generative AI application.
model = GenerativeModel.from_cached_content(cached_content=cached_content)
response = model.generate_content("Cany ou explain more about the Gemini model architecture?")
Decrease latency: Async prompting for Vertex AI Gemini
Recently I advised a startup with building a solution that uses Gemini. As part of this solution, the team has to make several calls to the API in order to get the output. Asking the model several questions can be time consuming and increases the overall latency of your application. To solve this problem, you can prompt Gemini asynchronously.
While generating responses from language models can be time-consuming, especially for multiple requests, we can leverage asynchronous programming in Python using asyncio and tenacity with the Vertex AI SDK. This allows us to send multiple prompts in parallel, drastically reducing your overall wait time.
The code snippet below defines an asynchronous function async_generate that utilizes the Vertex AI GenerativeModel to generate content based on a given prompt. It employs a retry mechanism to handle potential failures and ensures asynchronous execution for improved efficiency when dealing with multiple requests.
import asyncio
from tenacity import retry, wait_random_exponential
import vertexai
from vertexai.generative_models import GenerativeModel
@retry(wait=wait_random_exponential(multiplier=1, max=120))
async def async_generate(prompt):
model = GenerativeModel(
"gemini-1.5-pro-001",
)
response = await model.generate_content_async(
[prompt],
stream=False,
)
return response.text
Next you can use the async_generate function to request Gemini to generate a sport per item in parallel.
sports_items = [‘Puck’, ‘Racket’, ‘Gi’, ‘Goggles’]
get_responses = [async_generate(‘Give me a sport for the following item: ‘ + f) for f in sports_items]
recipes = await asyncio.gather(*get_responses)
Want to learn more? Have a look at this Medium blog.
One-click deployment: Quickly deploy world class foundation models
Remember the days when deploying an open-source model was a series of ML engineering acrobatics? The endless API configurations and custom input functions? Those days are fading. Vertex AI simplifies this process, allowing you to quickly deploy open models like Gemma 2 or LLaMA and use them as a managed API through Models as a Service, saving you time and effort.
One foundation model capturing the hearts and minds of the open-source community is Gemma 2, and for good reason. Now, you can deploy your Gemma 2 model on Vertex AI or GKE with remarkable ease. This will make deploying and integrating Gemma 2 into your use cases easier.
When would you use this vs a Model as a Service? Imagine you’re building a real-time language translation app for a multinational conference in Singapore. To minimize latency and ensure a seamless experience for your users, you’d want to deploy your Gemma 2-powered translation model in the asia-southeast1 region, which is geographically closest to your users. With Vertex AI, you have the flexibility to deploy your models where they’re needed most, ensuring optimal performance and responsiveness.
What’s next?
We’ve seen how Vertex AI can make your life with generative AI better with larger context windows, context caching, asynchronous execution, and easy deployment. Ready to dive deeper into the world of Generative AI with Vertex AI? Join the Cloud community on Medium for the latest technical blog posts.
Explore our Generative AI repository on GitHub, packed with code samples and tutorials, including this introductory notebook on long context windows. I’m excited to see what you build with Vertex AI and Generative AI!
Feel free to leave a comment below if you have any questions or want to share your projects.