Building and Evaluating Basic and Advanced RAG Applications with LlamaIndex and Gemini-pro in Google Cloud — Part 1

Ishmeet Mehta
Google Cloud - Community
6 min readMar 5, 2024

Retrieval Augmentation and Generation is an effective method to answer questions about your data. In this two-part series, we will see how we can build and evaluate basic and advanced retrieval techniques.

To build and produce a high-quality RAG system that can be used in production we need to consider the following factors:

  1. Effective Retrieval Techniques: Evaluate effective Retrieval and/or Generation strategy to provide LLM with highly relevant context which it can use to generate its answers.
  2. Comprehensive Evaluation Framework: To help you efficiently iterate and improve your RAG systems both during initial development and post-deployment measurement.

A recent RAG survey paper (“Retrieval-Augmented Generation for Large Language Models: A Survey” Gao, Yunfan, et al. 2023EditSign") listed some of the key advanced methods we can use to improve the retrieval quality of our RAG pipelines by introducing pre and post retrieval strategies.

In this tutorial, I would like to create Basic and Advanced RAG pipelines using LlamaIndex with the Gemini-pro model. Also, we would be using Trulens RAG triad of metrics to evaluate the performance of our application by computing context relevance, answer relevance, and groundedness.

Let’s get started.

Pre-requisites and Set up

I used Google Cloud Colab notebook to build my notebook.

Also, make sure you have the necessary permission to upload and download files from Google Cloud Storage required to run this notebook.

  1. Install the necessary packages in your notebook.

Note: To ensure the compatibility between different versions of the packages we are using specific versions. We are using LlamaIndex as a Data Framework, Trulens as an evaluation framework, and Litellm to help us simplify LLM completion and embedding calls in conjunction with Trulens.

!pip install pypdf cohere llama-index==0.9.48 google-generativeai trulens_eval==0.22.1 litellm==1.23.10 torch sentence-transformers

2. Set the Environment variables for API Keys

You will need to get an API key from Google AI Studio. Once you have one, you can either pass it explicitly to the model or use the GOOGLE_API_KEY environment variable.

%env GOOGLE_API_KEY=...
%env GEMINI_API_KEY=..

3. Import keys

import os

GOOGLE_API_KEY = "" # add your GOOGLE API key here
GEMINI_API_KEY = "" # add your GEMINI API key here
os.environ["GOOGLE_API_KEY"] = GOOGLE_API_KEY
os.environ["GEMINI_API_KEY"] = GEMINI_API_KEY

4. Use Gemini().completion method to ask Gemini to generate a desired text for you.

from llama_index.llms import Gemini
resp = Gemini().complete("Write a poem about a magic backpack")
print(resp)

Now that we have established connectivity to Gemini LLM, let's start working on the RAG pipeline.

5. Authenticate to the Google Cloud Storage where the artifacts and other documents to run this lab are stored.

from google.colab import auth

auth.authenticate_user()

6. Import the utils file as a module. This module contains pre-built feedback functions and other utilities to help you run evaluations later in this section.

!gsutil cp gs://machine-learning-gemini/utils.py .
import importlib
importlib.import_module('utils')

What is Basic RAG?

In RAG, your data is loaded and prepared for queries or “indexed”. User queries act on the index, which filters your data down to the most relevant context. This context and your query then go to the LLM along with a prompt, and the LLM provides a response using the context.

Basic RAG pipeline

Steps for Basic RAG pipeline

Before we start ensure you have PDFs and utility files mentioned in the utils folder uploaded to your Google Cloud Project to which you have access.

  1. Now, download the pdf file using gsutil command line utility to your Colab notebook. We are using this pdf(unstructured document ) as context to the Gemini-prop model that we connected to earlier.

Note: You can use your pdf and eval questions( relevant to the provided context) for this exercise as well.

!gsutil cp gs://machine-learning-gemini/eBook-How-to-Build-a-Career-in-AI.pdf

2. In this step, we ingest the pdf, chunk it into smaller sizes, create embeddings using our model, and index using VectorStoreIndex.

from llama_index import SimpleDirectoryReader
documents = SimpleDirectoryReader(
input_files=["eBook-How-to-Build-a-Career-in-AI.pdf"]
).load_data()
print(type(documents), "\n")
print(len(documents), "\n")
print(type(documents[0]))
print(documents[0])
print(documents[1])

3. Next, we would merge all 41 pages of the PDF in a single document for text-splitting accuracy when using more advanced retrieval methods.

from llama_index import Document

document = Document(text="\n\n".join([doc.text for doc in documents]))

4. We would build the index using the VectoreStore Index in the llamaIndex. We are going to define the LLM and embedding model.

Note: We are using huggingface face embedding model

from llama_index import VectorStoreIndex
from llama_index import ServiceContext
from llama_index.llms import Gemini


llm = Gemini(model="models/gemini-pro", temperature=0.1)

# Create a service context using the Gemini LLM and the model
service_context = ServiceContext.from_defaults(
llm=llm, embed_model="local:BAAI/bge-small-en-v1.5"
)
index = VectorStoreIndex.from_documents([document],
service_context=service_context)

5. We obtain a query engine from this index that allows us to send user queries that do retrieval on synthesis against our loaded PDF.

query_engine = index.as_query_engine()

Let’s try out our first request using this query engine.

response = query_engine.query(
"What are steps to take when finding projects to build your experience?"
)
print(str(response))

Now that we have set up a basic RAG pipeline next steps would be to run evaluations against this basic pipeline using the RAG triad. This would help us figure out how well the Basic RAG performs in comparison with advanced retrieval techniques discussed in part 2.

In this, we use Trulens to initialize the feedback functions to evaluate our app. This consists of a parallelized comparison between query, response, and context.

For this exercise, I have created three feedback functions context relevance, answer relevance, and groundedness as utilities in the utils file.

RAG triad of metrics

6. Now we need to create a questionnaire on which we can test our application.

Note: We have a set of evaluation questions in the utils folder to evaluate the responses from the model.

The next steps are to download this file and add additional questions if you need.

!gsutil cp gs://machine-learning-gemini/eval_questions.txt .

7. Now we need to initialize Trulens module to begin our initialization process and reset the Trulens database that stores the results.

from IPython.display import JSON
from trulens_eval import Tru, Feedback

tru = Tru()

tru.reset_database()

LLMs are growing as a standard mechanism for evaluating Generative AI applications at scale rather than relying solely on expensive human evaluations or set benchmarks. Also, with this technique LLMs allow us to evaluate our applications in the custom domains that we operate in and dynamic for the changing data demands for our applications.

8. Here we are using the custom Truelens recorder to record our evaluations during the benchmarks which are pre-loaded RAG triad functions( context relevance, answer relevance, and groundedness ). We also specify to track this version of the application for later comparison,0

from utils import get_prebuilt_trulens_recorder

tru_recorder = get_prebuilt_trulens_recorder(query_engine,
app_id="Direct Query Engine")

9. Now running the query engine with Trulens context we set earlier.

with tru_recorder as recording:
for question in eval_questions:
response = query_engine.query(question)
records, feedback = tru.get_records_and_feedback(app_ids=[])
records.head()

Note: Please view the notebook for the complete result set.

What is happening here?

We are sending each prompt query in the eval set to our query engine. And in the background, the TruLens recorder is evaluating each of our queries against these three metrics.

Here we can see the list of queries associated with their responses and how well they have performed against these functions.

10. Run the TruLens dashboard using Streamlit to view all the results on the dashboard

tru.run_dashboard()

The dashboard would look something like this:

We will continue exploring advanced techniques in part 2 in detail.

--

--

Ishmeet Mehta
Google Cloud - Community

Enterprise Cloud Architect Google Cloud - Machine Learning and Generative AI Developer