RAG : A Beginner’s Guide to Understanding the Basics

Safouane Ennasser
10 min readAug 10, 2024

--

This article will help new users learn about RAG. We will use simple tools like Groq (free version), llamaindex, and Huggingface embeddings as examples to explain how it works.

Huggingface, llamaindex and Groq to understand RAG

Summary:

Introduction

1- RETRIEVAL
. Keyword-Based Retrieval
. Semantic-Based Retrieval Using Embeddings
. From Keyword-Based to Semantic-Based Retrieval
1.1 LlamaIndex in Action
. Why LlamaIndex?
. Prepare your data :
. Using Huggingface Embeddings with LlamaIndex
. Query Time: Embedding the Query Text
1.2 llamaindex Retriver

2- Generation
1.1 Manual Generation from Retrieved Data
. Constructing the Prompt
. LLM Solutions: Open-Source and Proprietary
. Groq: Combining the Best of Both Worlds (FREE TIER)
. Generation final step
1.2 Automatic generation Using Query engine

Conclusion

Introduction

Large language models (LLMs) have been revolutionary in providing intelligible, human-like responses to a wide range of queries. However, they come with limitations, especially when dealing with private data.
But another significant hurdle is the “context window” — the amount of information an LLM can process at a time. This constraint becomes prominent when we deal with large documents or datasets.

Here’s where RAG, or Retrieval-Augmented Generation, steps in. RAG combines LLM capabilities with information retrieval techniques. The essence of RAG is simple: instead of cramming all the relevant data into a limited context window, we retrieve the most pertinent chunks of information as needed.

For instance, you can’t expect an LLM to read and understand an entire book in one go. So, we break the data into smaller, digestible chunks. Then we use the most relevant chunks as context for the LLM (The context window could manage this as it contains only some chunks of the book).

Before asking the LLM a question about that book/ document/ .., we have to fetch first the most relevant part from the book that could answer the question, and could fit into LLM context window. then pass it to the LLM as a context with the question.
The core of the Retrieval is getting the right chunks of data that could answer a question.
Note that until this step, we didn’t interact with any LLM.

1- RETRIEVAL:

The retrieval phase is crucial in RAG as it ensures that the LLM accesses the most relevant pieces of information from the vast pool of data chunks. This process involves advanced techniques for finding data chunks that fit not just the literal keywords but also the semantics or meaning behind your query.

- Keyword-Based Retrieval

How It Works:
Imagine you have chunks of text about different topics. When you search for “climate change effects,” keyword-based retrieval will scan through all the chunks and return those that contain the exact words “climate,” “change,” and “effects.”

Limitations:

  • Synonym Issues: Suppose there’s a chunk that extensively discusses “global warming” and its impacts, like rising sea levels and changing weather patterns, but doesn’t use the exact phrase “climate change effects.” A keyword-based search would miss this chunk.
  • Context Matters: Another chunk might include the word “effects,” but in a completely unrelated context, such as “the effects of exercise,” making it irrelevant to your query.
  • Spam Chunks: Texts with repeated keywords might be flagged as relevant, even if they provide no useful information.

Semantic-Based Retrieval Using Embeddings

- Embeddings Explained:

Embeddings are numerical representations of text that capture the semantic meaning of words, phrases, or entire documents. In simple terms, they convert text into vectors — strings of numbers in a multi-dimensional space. Similar texts will have vectors that are closer together in this space.

Imagine you have a big bag of words. Now, if you want to understand what these words mean or how they relate to each other, you might try to draw a map where similar words are close together.
That’s what embeddings do. They turn words (or other types of data like images or sentences) into numbers so that computers can understand them

- How Semantic Retrieval Works:

When you search for “climate change effects,” the system doesn’t just look for exact keywords. Instead, it translates your query into a vector and searches for chunks with vectors that are close to it in the embedding space.

Example:

  • Query: “climate change effects”
  • Relevant Chunks: Using embeddings, the system can identify a chunk discussing “global warming” impacts on weather patterns and sea levels as highly relevant, even though it doesn’t use the exact words “climate change effects.”
  • Non-Relevant Chunks: Conversely, a chunk that discusses “the effects of exercise” will not be considered relevant because its vector representation will be far from the query vector in the embedding space.

- From Keyword-Based to Semantic-Based Retrieval

Transitioning from keyword-based lookup to semantic-based retrieval involves a significant shift in how we think about search. Instead of merely looking for occurrences of specific words, we focus on understanding the meaning behind the search queries and chunks of data.

1.1 LlamaIndex in Action

LlamaIndex is a tool designed to simplify the process of integrating LLMs with retrieval systems like RAG. It offers several advantages over its competitors.

- Why llamaIndex?

  • Ease of Use: LlamaIndex provides user-friendly interfaces for creating and managing indexes of documents and their embeddings.
  • Scalability: It supports large datasets, making it suitable for applications with vast amounts of text.
  • Flexibility: It allows customization based on specific use cases and requirements.

Install llamaindex :

#using pip
pip install llama-index

#or using poetry
poetry add llama-index

This will install all needed packages for loading and chunking data.

- Prepare your data :

First put your source text files into the same directory

from llama_index.core import SimpleDirectoryReader

reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()
#At this step,
#all your documents are loaded from the directory "path/to/directory"

At this step the variable “documents” contains all our text files loaded with llamaindex.
Next step is to create an index for the loaded data. that will potentialy contain the embeddings for each node in the loaded documents.

- Using Huggingface Embeddings with LlamaIndex

Huggingface provides state-of-the-art models for generating text embeddings, which you can use to build your semantic index. Below is an example of how to create a vector-based index in LlamaIndex using embeddings from Huggingface.

First install llamaindex Huggingface integration lib

pip install llama-index-embeddings-huggingface
#OR
poetry add llama-index-embeddings-huggingface

Befor we integrate the embeddings in our project, let’s test it to get familar with.

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

#Please check Huggingface to explore all available models
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")

# Try to embedd any sentence and see the result
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])

Now the goal is to generate embeddings for all documents loaded previously with llamaindex, then create an index over it.
This could be done in a single line of code:

from llama_index.core import VectorStoreIndex

index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)

This will launch embeddings calculation for each loaded document, and store the whole structure in a VectorStore.
Now our index is ready to receive queries and returning the most relevant chunk (llamaindex document node)

- Query Time: Embedding the Query Text

At query time, the embedding model plays a critical role once again. To find the most relevant documents or data chunks, you will need to convert the query text into an embedding. This embedding captures the semantic meaning of the query, enabling the system to search for similar embeddings in your index.

Here’s an overview of the steps involved at query time:

  • 1 Embed the Query Text: Use the same embedding model to convert the query text into an embedding vector.
  • 2 Search the Index: Compare the query embedding with the embeddings stored in the vector index to find the most semantically similar documents.
  • 3 Retrieve and Rank: Fetch the top-ranked documents and return them as the results.

1.2 llamaindex Retriver

LlamaIndex simplifies this process with its retriever , which abstracts away the complexity and makes it easier to conduct efficient semantic searches.

Let’s request our VectorIndex to get the most similar chunks giving a query text.

#Note the argument similarity_top_k=5.
#It indicates that we need the top 5 chunks similar to our query text.
retriever = index.as_retriever(similarity_top_k=5)

# After instanciating the retriever from the index, use the retrieve() method
# to get similar chunks.
nodes = retriever.retrieve("YOUR_QUERY_TEXT_HERE")

#For each retrieved node, show the score
print(f"Found {len(nodes)} Nodes")
for n in nodes:
print(f"SCORE: {n.get_score()} | ID: {n.node_id}... | TEXT {n.text.lstrip()[:20]}")

#Result should be something like

# Found 5 Nodes
# SCORE: 0.4033362444287719 | ID: ba90e6c4-b397-48d6-8e20-4943a563b7fd... | TEXT How to avoid nightma
# SCORE: 0.2708382362746425 | ID: d40b4b4d-257c-47b3-850d-b9e365ba6605... | TEXT Along with the advic
# SCORE: 0.18155664141093883 | ID: f85702c5-dabe-4028-8aa1-6af482b09d63... | TEXT Elizabeth said anxie
# SCORE: 0.13770697617637015 | ID: 51e7f694-6b7a-4eb6-895e-fcbe99a1c967... | TEXT Getting tired from t
# SCORE: 0.12482057603527985 | ID: 4a872883-5083-4137-9448-e49e6d675baa... | TEXT Allowing the user to

As you can notice, we are using the embedding calculation in two seperate parts of the RAG:

  • Index building (to embedd all loaded documents)
  • Query time (to embedd the query before checking into the index)

2- Generation

After successfully retrieving the most relevant documents or data chunks using semantic search, the next phase is Generation. This step leverages the power of Large Language Models (LLMs) to generate meaningful and contextually relevant responses based on the retrieved information.

2.1 Manual Generation from Retrieved Data

To give a clear understanding of the entire RAG process, let’s manually generate a response using the retrieved data before leveraging the fully automated query engine in LlamaIndex. This will help you comprehend how the retrieved data is used to form a meaningful response.

- Constructing the Prompt

The final task in Retrieval-Augmented Generation (RAG) involves constructing a prompt that combines the query and the relevant data chunks to generate a coherent response. This prompt is fed into the Large Language Model (LLM) to get an answer that leverages both the retrieved information and the generative capabilities of the model.

#Prepare the prompt by concatenating the question and the relevent
# text chunks as context.
question = "YOUR_QUERY_HERE"
prompt_template = f"Question: {question}\n\nRelevant Information:\n" + "\n".join([f"- {node.get_text()}" for node in nodes]) + "\n\nAnswer based on the above information"

After building the prompt, let’s ask the llm to answer our question.

- LLM Solutions: Open-Source and Proprietary

There are various LLM solutions, each with its own advantages:

  • Open-Source Models:
    Examples: GPT-Neo, GPT-J, BLOOM, and models from Huggingface.
    Benefits: Free to use, customizable, and transparent.
  • Proprietary Models:
    Examples: OpenAI’s GPT-3/4, Microsoft’s AI solutions, and IBM Watson.
    Benefits: Offer cutting-edge performance, reliable support, and are often easier to integrate for enterprise solutions.

- Groq: Combining the Best of Both Worlds (FREE TIER)

In this section, we’ll walk through the final steps of using a Groq LLM to generate a coherent response based on the prepared prompt.

Groq offers a unique proposition with its free tier. It allows you to use powerful LLMs like the latest open-source models without the hefty price tag. Their free tier is particularly useful for developers and small businesses who want to leverage state-of-the-art LLMs without incurring significant costs.

To use Groq’s free tier, follow these steps:

  • Create an Account on Groq Website (It’s Free!!)
  • Visit the Groq website and sign up for a free account.
  • Navigate to the API section and create an API key. Remember to copy this key as it is only displayed once.

Install Llamaindex Groq integration

pip install llama-index-llms-groq
#OR
poetry add llama-index-llms-groq

Now let’s instanciate a Groq LLM to be used as BaseLLM, for generating the final answer

- Generation: final step

from llama_index.llms.groq import Groq

llm = Groq(model="llama3-70b-8192", api_key="YOUR_API_KEY")

#Final step

response = llm.complete(prompt_template)
print(response)

Congratulations! You’ve just created your first RAG. It’s pretty basic and manually built, but you’ve taken the first step.

1.2 Automatic generation Using Query engine

Referring to llamaindex official documentation:

Query engine is a generic interface that allows you to ask question over your data.
A query engine takes in a natural language query, and returns a rich response. It is most often (but not always) built on one or many
indexes via retrievers. You can compose multiple query engines to achieve more advanced capability.

And this is what we actually need__ a tool that automate all the steps we made previously.

# First we create the query engine with embed_model and llm as parameters

query_engine = index.as_query_engine(embed_model=embed_model, llm=llm)

response = query_engine.query(question)
print(response)

These two lines of code, have made the retrieval and the generation part in one shot.
Under the hood, query_engine.query(question),did the following :

  • 1/ creates an embdeding for the query (the question)
  • 2/ retrieves the concerned chunks
  • 3/ wraps the chunks and the question into a prompt template
  • 4/ asks the LLM using that template

(we already did these steps manually)

Here is the final code:
Try with your own data, just put the text files in “path/to/directory

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core import Settings
from llama_index.llms.groq import Groq
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Instanciate LLM and Embedding providers
huggingface_embed_model = HuggingFaceEmbedding(
model_name="BAAI/bge-small-en-v1.5"
)
groq_llama3_llm = Groq(model="llama3-70b-8192", api_key="YOUR_API_KEY")

# Using llamaindex Settings to store global config for Embeddings and llm
Settings.llm = groq_llama3_llm
Settings.embed_model = huggingface_embed_model


# Load the data into llamaindex documents
reader = SimpleDirectoryReader(input_dir="path/to/directory")
documents = reader.load_data()

# Build index from documents, using embed_model
index = VectorStoreIndex.from_documents(documents)

# Get the query engine from the index
# ! Note that we didn't pass llm and embedd_model as arguments
# ! this will be loaded automatically from Settings
query_engine = index.as_query_engine()

# Get the response
response = query_engine.query(question)

Conclusion

This article introduced you to the basics of Retrieval-Augmented Generation (RAG), focusing on how to overcome the limitations of context windows in large language models. We covered:

  • The Retrieval Phase: How to split and store large datasets, and retrieve relevant chunks using both keyword-based and semantic-based searches.
  • The Generation Phase: How to construct prompts and use powerful LLMs like Groq models to generate meaningful answers.

Through this step-by-step guide, we’ve shown how to integrate various technologies including Huggingface embeddings, LlamaIndex, and Groq’s free-tier models to build a simplist RAG system.

What’s Next?

This article was just an introduction to RAG. Stay tuned for our upcoming articles where we will delve deeper into more complex topics like:

  • Advanced Pipelines: How to set up more sophisticated data loading pipelines
  • Storage Databases: Exploring efficient storage
  • Graph based RAG

--

--