Sitemap
Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

RAG API

30 lines of code is all you need for RAG. The easiest way to get started with RAG.

7 min readSep 13, 2024

--

The Google Cloud RAG API is one lesser-known offering that provides a super-fast and easy start into RAG.

In this article, we use Google’s RAG API to retrieve documents similar to our query and combine it with Gemini to answer our query. 30 lines of code is all you need to get your RAG up and running.

This article is part #7 of my Friday’s livestream series. You can watch all the previous recordings. Join me every Friday from 10–11:30 AM CET / 8–10:30 UTC.

Without RAG vs with RAG

A quick introduction to RAG for people who are new to this domain.

Without RAG, the LLM generates a response based on the pre-trained knowledge. This might work for generic questions, but it fails, for example, for internal questions of your company or documents the model never has seen before.

Let's assume we want to ask an LLM about your company's vacation policy. Since the model does not have this information, we need to retrieve it using RAG. If we don't retrieve it, the model best case cannot answer it, but in the worst case, the model hallucinates and provides wrong answers.

Press enter or click to view image in full size

How does the RAG API work?

In the background, invisible to us, the RAG (Retrieval Augmented Generation) API is doing the following:

  1. Creating a vector index.
  2. Embeddings are created for the documents we upload.
  3. The index is used to retrieve relevant documents for a query.
  4. Finally, we generate an answer by augmenting our LLM based on the retrieved documents.

Google’s RAG API is using LlamaIndex in a more managed way. Since this is managed, it is easy to set up and use.

Press enter or click to view image in full size

Documents

I used Gemini to create 3 imaginary product manuals for our use case. We use them to demonstrate the use of RAG.

  • Nimbus Weather Station
  • PureBrew Automatic Coffee Maker
  • AuraGlow Smart LED Lighting System

Dependencies

The RAG API is part of the Vertex AI SDK. In the case of the Python library, you can install it by running pip install vertexai .

This allows you to import the classes like this:

from vertexai.preview import rag
from vertexai.generative_models import GenerativeModel, Tool

Corpus

With Google's RAG API, everything is centered around a corpus, a document collection. First, we need to create this corpus, which is, by default, empty.

corpus = rag.create_corpus(display_name="product manuals", 
description="contains all product manuals")

When creating the corpus, we can define the embedding model we want (This is an optional step).

The current default embedding model is text-embedding-004.

embedding_model_config = rag.EmbeddingModelConfig(
publisher_model="publishers/google/models/text-embedding-004")

corpus = rag.create_corpus(display_name="product manuals",
description="contains all product manuals"
embedding_model_config=embedding_model_config)

Additionally, you can use fine-tuned embeddings and open-source embeddings hosted as a Vertex AI Endpoint.

ENDPOINT_ID = "ENDPOINT_ID"
MODEL_ENDPOINT = "projects/{PROJECT_ID}/locations/us-central1/endpoints/{ENDPOINT_ID}"

embedding_model_config = rag.EmbeddingModelConfig(
endpoint=MODEL_ENDPOINT)

corpus = rag.create_corpus(display_name="product manuals",
description="contains all product manuals"
embedding_model_config=embedding_model_config)

Upload Documents to the Corpus

As a next step, we upload our 3 manuals to the corpus by providing a local file path.

The RAG API chunks your documents, creates the embeddings, and stores them in an index. You can also delete or update documents.

rag_file = rag.upload_file(
corpus_name=corpus_name,
path="./documents/auraglow_manual.txt",
)

rag_file = rag.upload_file(
corpus_name=corpus_name,
path="./documents/nimbuscloud_manual.txt",
)

rag_file = rag.upload_file(
corpus_name=corpus_name,
path="./documents/purebrew_manual.txt",
)

A wide range of document types are supported. Google documents, Google drawings, Google slides, HTML files, JSON files, Markdown files, Microsoft PowerPoint slides (PPTX file), Microsoft Word documents (DOCX files), PDF files, and Text files.

You are not limited to local files. You can also upload multiple documents directly from Google Cloud Storage and Google Drive. When importing files, you can optionally define the chunk size and overlap. The files are not automatically synced if updated on Google Drive or Cloud Storage.

paths = []

paths.append("gs://doit-llm/rag-api/nimbuscloud_manual.txt")
paths.append("gs://doit-llm/rag-api/purebrew_manual.txt")
paths.append("gs://doit-llm/rag-api/auraglow_manual.txt")

rag.import_files(
corpus_name,
paths,
chunk_size=512,
chunk_overlap=100,)

Alternatively, you can also provide a Google Cloud Storage folder.

rag.import_files(corpus_name=corpus_name, 
paths=["gs://doit-llm/manuals"],
chunk_size=500)

There is even Slack and JIRA integration using a data connector.

If you use PDFs, you might gain additional quality by enabling advanced PDF parsing.

response = rag.import_files(
...
use_advanced_pdf_parsing=True
)

Retrieving

The retrieval is done by providing our question and the corpus. The RAG API will return the document chunks with the most semantic similarity to our question. You can configure the number of retrieved document chunks and the distance threshold.

corpus = rag.get_corpus(name=corpus_name)
print(corpus)

response = rag.retrieval_query(
rag_resources=[
rag.RagResource(
rag_corpus=corpus_name,
)
],
text="How do I install the LED light strip?",
similarity_top_k=10, # Optional
vector_distance_threshold=0.5, # lower distance means the document as a high similarity to our query
)
print(response)

Augment and Generate

The final step combines the retrieved document chunks and augments them with our LLM to generate an answer. That sounds weird. Let me rephrase: We send all the documents with our question to the LLM, which uses the documents to answer the question. This process is also called Grounding.

rag_retrieval_tool = Tool.from_retrieval(
retrieval=rag.Retrieval(
source=rag.VertexRagStore(
rag_resources=[
rag.RagResource(
rag_corpus=rag_corpus_id
)
],
similarity_top_k=3,
vector_distance_threshold=0.5, e
),
)
)

rag_model = GenerativeModel(
model_name="gemini-1.5-flash-001", tools=[rag_retrieval_tool]
)

response = rag_model.generate_content("How do I install the LED light strip?")
print(response.text)

We use Gemini's built-in tooling capabilities to automatically augment the retrieved documents. If you are not using Gemini, you can still use the retrieved document chunks from the retrieval step and pass them as a context to your LLM.

This is the final answer retrieved from our “AuraGlow Smart LED Lighting System” manual.

1. Unpack the components: LED light strip, control hub, and power adapter. 
2. Mount the light strip: Peel the adhesive backing off the LED strip and attach it to the desired surface.
3. Connect to power: Plug the USB power adapter into the control hub, then into a power outlet.
4. Sync with Wi-Fi: Download the AuraGlow app and connect the Wi-Fi control hub to your home network.

Confluence RAG

Google's RAG API has not officially supported Confluence yet. Luckily, we have full control and can load the documents ourselves.

If you want to do RAG with Confluence, check out the implementation in the repository. Including proper citations.

Limits

Updating documents

When uploading documents with upload_file you cannot provide a unique name. The unique name is automatically generated. That makes updating the corpus with simple file names impossible. This introduces additional implementation efforts to handle document updates.

Uploading multiple files

When uploading files with import_files you can either provide a Cloud Storage folder or a list of Cloud Storage file paths.

I recommend using a folder as a list of single file paths, which is limited to 25 files, while a folder does not have a limitation.

(Thanks for the Livestream audience pointing that out)

Error occurred after 0 files: ('Failed in importing the RagFiles due to: ', 
InvalidArgument('GCS URIs cannot be specified more than 25 times.'))

Ingestion limit?

According to Google documentation, ingestion supports up to 1000 files. I could not confirm that limit, let me know if you in any way run into this would love to reproduce.

Pricing

When using Google Cloud Ranking API, you are billed for different services.

  1. Google Embedding API when uploading documents.
  2. Google Gemini Model, when using Grounding / Tool with Gemini, is billed by the number of characters.

I still need to clarify the pricing for the index and retrieval. If you know something, let me know. Until then, I will contact Google for some numbers.

What is the advantage of this approach?

It's super fast. There's no need to set up a vector database like Google Cloud Vector Search or PGVector with CloudSQL or AlloyDB. However, it also comes with limitations like the ones we covered. Still, I believe 80% of the use cases can be covered with this simple approach.

What else can I use?

Multiple offerings on Google Cloud provide Retrieval Augmented Generation. And depending on your use case, one might be a better fit. If you are unsure, write me on LinkedIn, and we can discuss it.

Basic

The easiest one is Vertex AI Grounding, which allows you to validate the model's answer using Google Search and your own documents.

Intermediate

In addition, we have the RAG API, which provides a bit more control while keeping the implementation and management efforts low.

Advanced

And finally, we have a custom solution that provides full flexibility but introduces much more complexity.

The full code for this article is available on GitHub

Thanks for reading and watching

I appreciate your feedback and questions. You can find me on LinkedIn. Even better, subscribe to my YouTube channel ❤️.

Press enter or click to view image in full size
Llama generated with Imagen 3

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Sascha Heyer
Sascha Heyer

Written by Sascha Heyer

Hi, I am Sascha, Senior Machine Learning Engineer at @DoiT. Support me by becoming a Medium member 🙏 bit.ly/sascha-support

Responses (4)