Boosting AI’s Power: How Retrieval-Augmented Generation and LlamaIndex Are Enriching LLM Responses

Published in

evolv Consulting

7 min readNov 20, 2023

In order for our LLM solutions to deliver value, we must ensure sophisticated, context-aware, and up-to-date responses. This is where LlamaIndex comes in: it provides an invaluable source of knowledge for developers looking to build cutting edge AI solutions. Retrieval-Augmented Generation (RAG) is a cutting-edge approach that combines the strengths of two AI powerhouses: generative language models and dynamic information retrieval systems. In this article, we’ll explore the world of RAG, its significance on responses, and how to optimally tune for the best response using LlamaIndex — a tool that is quickly becoming the industry standard for working with RAG.

RAG is not always a necessity — and can add unnecessary computational demands, complexity, and increased latency to your model workflow if not needed.

Consider using a RAG pipeline in these scenarios

Your LLM needs to be trained on dynamically changing data.
Your LLM has issues with long term memory.
Your LLM is prone to generating hallucinated responses.

How does RAG work?

Normally, an LLM answers questions based on the information it was trained on in the past. However, RAG allows the LLM to look up additional, current information from a large database whenever it needs to answer a question. This means it can provide answers that are not only based on what it already knows, but also on the most recent and relevant information available. It’s like having the ability to use a constantly updated reference book to make sure its answers are as accurate and up-to-date as possible.

RAG is different from transfer learning. With transfer learning — data is injected directly into the model during the training phase. On the other hand, RAG retrieves relevant information from a vector database and then injects that data directly into the prompt during inference.

More technically, RAG operates by first using a query encoder to transform the input prompt into a query vector. This vector is then used to perform a similarity search in the vector space of the knowledge base, retrieving the most relevant documents. These documents are then fed, along with the original prompt, into a sequence-to-sequence model, which synthesizes the retrieved information with its inherent language generation capabilities. The model is able to combine the depth and breadth of external data with sophisticated language understanding, leading to more informed, accurate, and contextually rich responses.

More on Vector Databases

In a RAG system, vector databases serves as the storage and retrieval backbone. The true power behind a vector database is that it capitalizes on both structured and unstructured data. They combine vector embeddings, which are representations of unstructured data, with structured data such as metadata and indexes. By labeling vector embeddings with descriptive metadata and indexing them, it is able to provide additional layers of context. This context enriches the embeddings, making them not just points in space but meaningful representations with identifiable characteristics. This extra information can be crucial for more nuanced retrieval tasks, such as finding the most relevant piece of information among many similar items. It allows for more precise and context-aware operations within the database.

Chunk Sizes and Chunking Methods

There are two chunk-related parameters we have control over; chunk size and chunking method. Our goal in tuning the chunk size and chunking method is to retrieve the optimal chunks that provides the most relevant context while minimizing the noise introduced into the prompt. Additionally, performance should be a key metric when adjusting the chunk parameters.

Chunk Sizes

Chunk size refers to the portions of text data that an LLM handles in a single instance. In a RAG setup, where the LLM is augmented with external data retrieval, determining the right chunk size is a balancing act. It’s about ensuring the model has enough context to understand and respond accurately without overwhelming its processing capacity. There’s no one-size-fits-all answer for the ideal chunk size in a RAG pipeline. It varies based on the specific application, the LLM’s capabilities, and the nature of the data being processed. The goal is to tailor the chunk size to ensure that the LLM has enough context for a comprehensive understanding while remaining within its processing limits for efficient performance. Chunk sizes might seem like a small detail in the grand scheme of a RAG pipeline, but their impact on the effectiveness of LLM responses is significant. By fine-tuning chunk sizes, we can greatly enhance the ability of LLMs to produce accurate, relevant, and coherent responses, fully leveraging the power of RAG technology.

Chunking Methods

Consider a RAG pipeline using an LLM like GPT-3, which can handle 2,048 tokens per chunk. Now, let’s take a 4,000-token long news article for summarization, exceeding the LLM’s chunk limit.

Standard Approach: If we split the article into two 2,000-token chunks, each chunk is processed independently, risking context loss. The LLM might miss the continuity, leading to a disjointed summary.

def split_text_standard(text, chunk_size):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

# Assuming each token is roughly a word for simplicity
token_limit = 2048
long_text = "..."  # Your long text of 4000 tokens

# Split the text into chunks
chunks = split_text_standard(long_text, token_limit)

# Process each chunk
for chunk in chunks:
    response = llm(chunk)
    print(response)

Overlapping Chunking: By breaking the article into overlapping 1,000-token chunks (e.g., chunk 1 has tokens 1–1,000, chunk 2 has tokens 501–1,500), we maintain context across chunks. This overlap ensures that when the LLM integrates RAG’s retrieved data, it has a complete understanding of the article, resulting in a more cohesive and accurate summary.

def split_text_overlapping(text, chunk_size, overlap):
    return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size-overlap)]

token_limit = 2048
overlap_size = 500  # Overlapping tokens
long_text = "..."  # Your long text of 4000 tokens

# Split the text into overlapping chunks
chunks = split_text_optimized(long_text, token_limit, overlap_size)

# Process each chunk
for chunk in chunks:
    response = llm(chunk)
    print(response)

Question & Answer-Based Chunking: In question & answer-based chunking, chunks are tailored to anticipated queries or specific user interests. This method segments text into parts that are likely to be relevant to frequent questions. This approach ensures that the RAG system retrieves chunks that are highly relevant to user questions, enhancing the precision of responses.

def split_text_query_based(text, queries):
    # Pseudo-code for query-based chunking
    chunks = []
    for query in queries:
        relevant_section = find_relevant_section(text, query)
        chunks.append(relevant_section)
    return chunks

# Example usage
queries = ['climate change', 'economic policies']  # Example queries
query_based_chunks = split_text_query_based(long_text, queries)

for chunk in query_based_chunks:
    response = llm(chunk)
    print(response)

When selecting chunk size and method keep in mind the balance and trade off of providing relevant context and maintaining efficient performance.

LlamaIndex

LlamaIndex is an advanced orchestration framework designed to amplify the capabilities of LLMs. Sound a lot like LangChain? Think of LangChain as a wider, more general purpose framework while LlamaIndex is a specialized framework optimized for indexing and retrieving data. Although these tools are similar, they do not have to be one or the other. You can utilize both.

The LlamaIndex API allows users to use LlamaIndex to ingest and query their data in 5 lines of code.

from llama_index import VectorStoreIndex, SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)

query_engine = index.as_query_engine()
response = query_engine.query("What did the author do growing up?")

This code snippet demonstrates how documents are read from a directory, indexed into a vector store for semantic searching, and then queried to find relevant information using the LlamaIndex API.

Prototyping a RAG application is easy, but making it performant, robust, and scalable to a large knowledge corpus is hard. So we shouldn’t stop at these 5 lines of code.

Node Parser

Node Parsers play a pivotal role in optimizing RAG applications by ensuring that the text chunks fed into the system are of an appropriate size and contextually relevant. Their ability to customize chunk sizes, overlaps, and metadata inheritance makes them an indispensable tool in the realm of efficient information retrieval and processing in RAG systems.

A Node Parser takes a list of documents and breaks them down into Nodes, each of a specific size. When a document is segmented into Nodes, all its attributes, including metadata and templates, are inherited by the child Nodes. This process ensures that each Node retains the essential characteristics of its parent document. The chunking typically uses a TokenTextSplitter, with default settings of 1024 tokens per chunk and an overlap of 20 tokens between chunks.

from llama_index.node_parser import SimpleNodeParser


node_parser = SimpleNodeParser.from_defaults(chunk_size=1024, chunk_overlap=20)

Additionally, Node Parsers offer several customization features:

text_splitter: Choose the method for splitting text into chunks. The default is TokenTextSplitter.
Customizing the text_splitter allows for more tailored chunking, especially for non-English languages or specific text formats like code.

SentenceSplitter Configuration:

import tiktoken
from llama_index.text_splitter import SentenceSplitter

text_splitter = SentenceSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=20,
    paragraph_separator="\n\n\n",
    secondary_chunking_regex="[^,.;。]+[,.;。]?",
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

TokenTextSplitter Configuration:

import tiktoken
from llama_index.text_splitter import TokenTextSplitter

text_splitter = TokenTextSplitter(
    separator=" ",
    chunk_size=1024,
    chunk_overlap=20,
    backup_separators=["\n"],
    tokenizer=tiktoken.encoding_for_model("gpt-3.5-turbo").encode,
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

CodeSplitter Configuration:

from llama_index.text_splitter import CodeSplitter

text_splitter = CodeSplitter(
    language="python",
    chunk_lines=40,
    chunk_lines_overlap=15,
    max_chars=1500,
)

node_parser = SimpleNodeParser.from_defaults(text_splitter=text_splitter)

include_metadata: Decide whether Nodes should inherit document metadata.
include_prev_next_rel: Option to include relationships between chunked Nodes.
metadata_extractor: Further processing to extract valuable metadata.

SentenceWindowNodeParser

The SentenceWindowNodeParser is a specialized parser that segments documents into individual sentences, with each Node encompassing a "window" of surrounding sentences. This parser is particularly useful for generating context-specific embeddings.

import nltk
from llama_index.node_parser import SentenceWindowNodeParser

node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_sentence",
)

LlamaIndex also offers these evaluation tools to help you achieve the best response from your LLM.

Correctness Evaluation: Evaluates the relevance and correctness of a generated answer against a reference answer.
Faithfulness Evaluation: Measure if the response from a query engine matches any source nodes.
Guideline Evaluation: Evaluates a question answer system given user specified guidelines.
Retriever Evaluation: Evaluates the quality of any Retriever module defined in LlamaIndex.
Semantic Similarity Evaluation: Evaluates the quality of a question answering system via semantic similarity.

Conclusion

LlamaIndex emerges as a pivotal tool in this landscape, offering streamlined processes for ingesting and querying data, alongside the invaluable Node Parser for efficient text chunking. With the right application and understanding of these technologies, we can push the boundaries of AI’s capabilities, ensuring more accurate, coherent, and contextually appropriate responses in a variety of applications.

###

To learn more about evolv Consulting, visit our website.