Building Vertex AI RAG Engine with Gemini 2 Flash LLM
Retrieval-Augmented Generation (RAG) is rapidly becoming the backbone of production-grade AI systems that rely on grounding language model responses with facts. Google recently announced general availability(GA) of Vertex AI’s RAG Engine — a fully managed service that enables to build and deploy RAG implementations using our data and methods.
In this article, I’ll walk you through a complete RAG workflow using Vertex AI and Gemini 2.0 Flash, from corpus creation to document import, query retrieval, content generation, and final cleanup .
What is Vertex AI RAG Engine ?
Vertex AI RAG Engine is a component of the Google Cloud Vertex AI Platform that facilitates Retrieval-Augmented Generation (RAG).
- Fully managed Vector Store
- Seamless integration with Gemini models
- Supports real-time retrieval and grounding
- Enhances LLM responses with relevant, retrieved context
- Ideal for building intelligent, Data-aware AI applications
Overview of the Retrieval-Augmented Generation (RAG) process in Vertex AI RAG Engine
1. Data Ingestion: Collect data from various sources like local files, Cloud Storage, or Google Drive.
2. Data transformation:: Prepare the collected data for indexing by processing and organizing it appropriately.
3. Embedding: Convert pieces of text into numerical representations (embeddings) that capture their meaning and context.
4. Data Indexing: Create an organized index (corpus.) of these embeddings to facilitate efficient search and retrieval.
5. Retrieval: When a user submits a query, search the indexed corpus to find relevant information.
6. Generation: Use the retrieved information as context to generate accurate and relevant responses to the user’s query.
Key features of Vertex AI RAG Engine
- DIY RAG: Easy, flexible setup for low to medium complexity use cases
- Vertex AI Search: Fully managed, high-quality search with minimal maintenance
- Connectors: Quick integration with sources like GCS, Drive, Jira, Slack
- Scalable Performance: Fast, low-latency search for large data volumes
- Better LLM Output: Accurate, relevant responses from enhanced retrieval
What is Corpus ?
A corpus, also referred to as an index, is a collection of documents or source of information.
In Vertex AI RAG, :
- A corpus is set of ingested documents (e.g., PDFs, websites, files)
- It acts as knowledge base (KB) for RAG
- Managed and used to support contextual responses from LLMs
Let us Build a RAG Pipeline with Gemini 2 Flash
We’ll break down the steps from the Vertex AI RAG quickstart for Python.
- Create and manage RAG corpus in Vertex AI
- Import documents from Google Cloud Storage
- Configure chunking and embedding models
- Perform direct context retrieval
- Generate AI responses enhanced with document context using Gemini models
- Clean up RAG corpora after demo
Pre-requisites
- Google Cloud account with Vertex AI access
- Python 3.8+
Step 1: Setting Up the Environment
- Install Python 3.8 or later.
- Install the Dependencies
pip install -r requirements.txt
pip install --upgrade google-cloud-aiplatform
from vertexai import rag
from vertexai.generative_models import GenerativeModel, Tool
import vertexai
Step 2: Enable google cloud APIs AI and GCS
Ensure Google Cloud AUTH LOGIN is completed after creating GCP PROJECT_ID and , then execute below step
gcloud services enable aiplatform.googleapis.com --project=PROJECT_ID
gcloud services enable storage.googleapis.com --project=PROJECT_ID
Set up IAM permissions:
gcloud projects add-iam-policy-binding vertex-ai-experminent --member="user:YOUR_EMAIL@domain.com" --role="roles/aiplatform.user"
gcloud projects add-iam-policy-binding vertex-ai-experminent --member="user:YOUR_EMAIL@domain.com" --role="roles/storage.objectAdmin"
Create GCS Bucket and upload PDF
# Create a new GCS bucket (skip if you already have one)
gsutil mb -l us-central1 gs://your-bucket-name
# Upload PDF files to the bucket
gsutil cp your-document.pdf gs://your-bucket-name/
# Verify files were uploaded successfully
gsutil ls gs://your-bucket-name/
Step 3: Initialize Vertex AI
Ensure Project ID and Corpus Name and GCS bucket is setup prior to that , In this demo , I have used my personnel GCP projects
PROJECT_ID = "YOUR_PROJECT_ID" # Your Google Cloud Project ID
display_name = f"{PROJECT_ID}_rag_corpus" # A name for your RAG corpus
paths = ["gs://YOUR_BUCKET_NAME/YOUR_DOCUMENT.pdf"] # List of file paths in GCS
vertexai.init(project=PROJECT_ID, location="us-central1")
This initializes the Vertex AI API with project ID and specifies the Google Cloud region to use (“us-central1”) which is currently available as of today in US region.
Step 4: Create Embedding model + RAG Corpus
Configure the embedding model to use for RAG. Embeddings are numerical representations of text that capture semantic meaning. Here, We are using Google’s “text-embedding-005” model.
embedding_model_config = rag.RagEmbeddingModelConfig(
vertex_prediction_endpoint=rag.VertexPredictionEndpoint(
publisher_model="publishers/google/models/text-embedding-005"
)
)
Create a vector database configuration for RAG, using the embedding model configuration defined earlier.
backend_config = rag.RagVectorDbConfig(rag_embedding_model_config=embedding_model_config)
This creates a new RAG corpus (a collection of documents) with the specified display name and backend configuration.
rag_corpus = rag.create_corpus(
display_name=display_name,
backend_config=backend_config,
)
Step 5: Import files into RAG Corpus
We define how documents will be processed when imported . Documents will be split into chunks of 512 tokens with 100 tokens of overlap between consecutive chunks.
paths = ["gs://vector-ai-rags/2312.10997v5.pdf"] # Replace with your actual path
rag.import_files(
rag_corpus.name,
paths,
transformation_config=rag.TransformationConfig(
chunking_config=rag.ChunkingConfig(chunk_size=512, chunk_overlap=100)
)
)
Step 6 : List all Files in RAG Corpus
This defines the configuration for retrieving information from the corpus. It specifies that the top 3 most relevant chunks will be retrieved, and only chunks with a vector similarity above a threshold of 0.5 will be considered.
rag.list_files(corpus_name)
rag_retrieval_config=rag.RagRetrievalConfig(
top_k=3,
filter=rag.Filter(vector_distance_threshold=0.5)
)
Step 7 : Retrieve Context
This performs a direct retrieval query against the corpus, asking “What is RAG and why it is helpful?”. It will return the most relevant chunks from the corpus based on the retrieval configuration.
response = rag.retrieval_query(
rag_resources=[
rag.RagResource(
rag_corpus=corpus_name,
)
],
text="What is RAG and why it is helpful?",
rag_retrieval_config=rag_retrieval_config,
)
print(response)
Step 8: Generate Gemini LLM response using RAG retrieval tool
This creates a RAG retrieval tool that can be used with a generative model. It specifies the corpus to retrieve from and the retrieval configuration.
rag_model = GenerativeModel(
model_name="gemini-2.0-flash-001",
tools=[Tool.from_retrieval(
retrieval=rag.Retrieval(
source=rag.VertexRagStore(
rag_resources=[rag.RagResource(rag_corpus=rag_corpus.name)],
rag_retrieval_config=rag.RagRetrievalConfig(top_k=3)
)
)
)]
)
response = rag_model.generate_content(query)
print(response.text)
Step 9: Demo
Step 10: Clean Up
Delete RAG Corpus and clean up all the GCP resources
rag.delete_corpus(corpus_name)
GitHub Repository
You can find the complete source code on GitHub Vertex-AI-RAG
Conclusion
Vertex AI RAG Engine provides a solid foundation for building advanced, production-grade GenAI applications on Google Cloud. With a fully managed vector DB and tight integration with Gemini, there’s no need to rely on third-party tooling.
In a future post, I’ll explore RAG + Cloud Functions and multi-agent orchestration using Vertex AI Search.