Offline RAG with LlamaIndex and tiny/small LLMs

A walk-through to build a simple RAG system using LlamaIndex and TinyLlama1.1B or Zephyr-7B-Gemma-v0.1. Insights and potential improvements.

Jeremy K
The Pythoneers
8 min readMar 13, 2024

--

“Image By Author”

In Exploring offline RAG with Langchain, Zephyr-7b-beta and DeciLM-7b, we delved into the intricacies of an offline Retrieval Augmented Generation (RAG) system, putting the emphasis on how to run it without transmitting data over the Internet, especially when dealing with personal or sensitive information. However, the landscape of open-source solutions for engaging with Large Language Models (LLMs) is vast. After LangChain, we expand our horizon by introducing another formidable contender in the arena: LlamaIndex.

While implementing RAG poses considerable challenges, we are compelled to investigate the performance of various LLM sizes, especially tiny and small LLMs. To this end, we have selected TinyLlama1.1B and Zephyr-7B-Gemma-v0.1 for examination. The former underwent a rigorous 90-day training regimen on 3 trillion tokens, while the latter, a fine-tuned version of Google Gemma-7B, stands as one of the premier 7B models available.

Interested in LLMs, RAG, or LlamaIndex? Let’s dive in together.

Understanding RAG — a brief recap

Let’s revisit the essence of Retrieval Augmented Generation, a concept we previously explored in the context of LangChain.

Large Language Models, while powerful, come with inherent limitations. They lack real-time information, possess no knowledge of your private data, and may exhibit a tendency to “hallucinate” when faced with unfamiliar queries.

RAG emerges as a solution poised to overcome these limitations by supplementing the LLM with contextual pieces, enriching the response to queries.

The RAG process involves several key steps:

  1. Document chunking: breaking down large documents into manageable chunks.
  2. Embedding extraction: capturing the essential features of each document chunk.
  3. Vector database storage: storing these embeddings in a vector database for efficient retrieval.
  4. Relevant context retrieval: retrieving context pieces relevant to the user’s query from the stored database.
  5. Querying the LLM: presenting the refined context and user query to the LLM for an accurate response.

For a visual representation of the RAG process, refer to the following graph:

RAG process

Now, the pivotal question arises: why opt for LangChain or LlamaIndex in this process? Let’s try to answer this question.

LlamaIndex or LangChain?

In straightforward terms, LLamaIndex plays a crucial role in retrieving context pieces and interacting with the LLM. It shares similarities with LangChain in the overall process, albeit with distinctive terminology: LLamaIndex refers to “chunks” as “nodes”.

But, what sets LLamaIndex apart from LangChain, and how can you make an informed choice between the two?

In a nutshell:

  • LLamaIndex: offers a plethora of options for processing/chunking various document types and it provides a rich array of retrieval possibilities.
  • LangChain: demonstrates more flexibility, providing extensive options for interaction with the LLM.

My personal opinion: consider experimenting with both to figure out their strengths and align them with your specific needs. Furthermore, don’t hesitate to explore a hybrid approach, for instance by employing LLamaIndex for chunking and retrieval, and LangChain for LLM interaction.

Let’s move to the practical implementation and turn concepts into reality.

Setting the stage for offline RAG

Here’s a breakdown of what you’ll need:

  • an LLM: we’ve chosen 2 types of LLMs, namely TinyLlama1.1B and Zephyr-7B-gemma-v0.1.
  • an embedding model: we will leverage sentence transformers, a Python framework for state-of-the-art sentence embeddings, and all-mpnet-base-v2 as the embedding model. We do encourage readers to try several models they can find on the Huggingface’s MTEB leaderboard (link).
  • a vector database: we will employ FAISS due to its ease of deployment and remarkable speed, making it an optimal choice for our purposes.

Let’s prepare the environment to run our offline RAG.

Installing dependencies

To kick off, ensure the following dependencies are installed. At the time we wrote this story, these were the latest versions of LlamaIndex and LangChain.

Please note both LangChain and LlamaIndex evolve very fast and you may have to adapt the code thereafter in case you use another version.

If you plan to process PDF documents, the installation of Pypdf is essential. For other document types, additional packages may be necessary.

Downloading the embedding model

To run our RAG system offline, we’ll load and save the embedding model locally.

Saving the embedding model locally

Downloading the LLM

For both LLMs, we will save the model and the tokenizer locally. To speed up loading, we create shards of 1GB.

Downloading Zephyr
Downloading TinyLLlama

With these steps performed, you can disconnect from the Internet, immersing yourself in the world of offline RAG. Let’s continue.

Implementation of the RAG system

To create an RAG system, you must follow several steps:

Reading the documents

Initiate the process by using the document loader to load the content of your file for subsequent processing. Here we use a PDF file about quantum computing.

Load the PDF file

Creating nodes

Node creation presents various strategies, and LlamaIndex offers a multitude of options. In this instance, we opt for the sentence splitter, defining chunk size and overlap. We encourage you to explore LlamaIndex’s documentation for parsers aligning with your specific needs.

Chunking the document

Extracting embeddings and storing them

In this step, we leverage sentence transformers to convert nodes into vectors and store them in the FAISS index. Note that LangChain is essential for utilizing an embedding model downloaded from Huggingface.

Storing chunks into vector DB

Instantiating the LLM

Although this could have been done previously, it is now time to load Zephyr-7b-gemma-v0.1 and/or TinyLlama1.1B from our local storage. Only use the code snippet fitting your needs.

Instantiating Zephyr
Instantiating TinyLlama

Retrieving context and querying the LLM

Define a query engine to retrieve context pieces for a given question. By default, two chunks are retrieved and employed to formulate an answer.

Querying the LLM with pieces of context

There you have it: a simplistic yet foundational implementation of RAG with LLamaIndex.

Results: a first comparative analysis

For the experiment, we use a PDF document about quantum computing (link) and question the RAG system. Additional questions/answers are provided at the end of the article for the sake of readability.

Question: What will quantum computing be capable of in 2040?

Answers and response times (Google Colab) to “What will quantum computing be capable of in 2040?”

Before jumping to conclusions, it’s imperative to underscore that RAG implementation poses significant challenges, and various factors can impact its efficacy, as we will elaborate on in the subsequent section. Nevertheless, even with a simplistic implementation of RAG, several insights have emerged:

  1. Challenges with tiny LLMs: Despite being equipped with relevant context, tiny LLMs may struggle to provide effective responses to RAG queries.
  2. Performance of Zephyr-7B-gemma-v0.1: In contrast, Zephyr-7B-gemma-v0.1 demonstrated greater accuracy and precision in its responses.
  3. Testing potential of tiny and small LLMs: tiny and small LLMs can still be valuable for testing purposes due to their faster processing speed and lower VRAM requirements. However, only Zephyr-7B-gemma-v0.1 appears sufficiently performant for practical RAG applications.
  4. Considerations for response time: While response time may vary depending on factors such as maximum tokens to generate and complexity, on average, Zephyr-7B-gemma-v0.1 exhibits a generation speed 9 to 30 times slower than TinyLlama1.1B. Thus, it’s crucial to consider hardware capabilities to avoid prolonged user waiting times.

Why is it a naive implementation of RAG and how to improve it?

In the journey of mastering RAG, starting with a basic (naive) approach before refining it to achieve optimal performance is key. Let’s delve into some of the challenges encountered and explore potential solutions:

  • Limited chunk retrieval: By default, LlamaIndex retrieves only 2 chunks, which may suffice for a single document. However, when dealing with multiple documents or complex queries, this limitation proves suboptimal.
    Potential solution: Customize the query engine to adjust the top-k value for chunk retrieval.
Custom query engine
  • Simplistic chunking approach: Defining the chunk size necessitates careful evaluation to strike a balance between providing enough context and avoiding overwhelming the LLM.
    Potential solution: Conduct hyperparameter tuning to optimize the chunk size.
  • Missing metadata: Querying multiple documents may result in a lack of global context, leading to the retrieval of chunks from irrelevant documents.
    Potential solution: Implement strategies such as utilizing a Summary Index or Metadata Replacement + Node Sentence Window to ensure efficient retrieval.
Summary Index
Metadata Replacement + Node Sentence Window
  • Fine-tuning prompt template: LlamaIndex employs a default prompt template for RAG, which may require refinement, especially when the context fails to provide relevant information.
    Potential solution: Refine the prompt template for improved performance.
Fine-tuning default prompt template
  • Irrelevant contexts retrieved: Not all retrieved-context pieces may be pertinent to answering the query, causing the LLM to struggle in discerning the correct response.
    Potential solution: Employ reranking techniques to prioritize relevant context pieces.
Reranking
  • Lost in the middle: Retrieving too many chunks, even after reranking, can result in the most crucial information being buried in the middle.
    Potential solution: Utilize prompt compression techniques either alone or in conjunction with reranking.
Prompt compression

By addressing these challenges and implementing the suggested solutions, you can refine your RAG implementation to achieve more accurate and efficient results, laying the groundwork for advanced implementation. You can also look into fine-tuning your embedding model or LLM but it should, in my opinion, be done at a later stage.

To conclude

Setting up a straightforward offline RAG with LlamaIndex proves to be a manageable endeavor. Throughout our experimentation, we gained valuable insights into the capabilities of tiny LLMs within the open-source realm. While both are suitable for testing purposes without excessive hardware demands, our findings indicate that only Zephyr-7B-gemma-v0.1 demonstrated sufficient performance for production applications, underscoring the potential need for larger models to address more intricate queries effectively.

Additionally, we’ve outlined various strategies for advancing RAG implementations and provided links to valuable resources for further exploration. We encourage readers interested in RAG to delve into these materials and explore additional sources of inspiration on platforms like Medium’s publications.

Lastly, it’s essential to acknowledge that there’s no one-size-fits-all approach to RAG implementation, given the diverse structures of documents and the multitude of use cases (querying documentation, reasoning on complex data…).

Wishing you success in implementing your offline RAG.

Source code

You can find the source code provided as a notebook here. For an offline implementation, adapt it with the code snippets provided above.

Attribution:

Vector icons created by Freepik — Flaticon

Additional questions/answers to the RAG system:

Question: What is quantum computing?

Answers and response times (Google Colab) to “What is quantum computing?”

Question: What are the ethical challenges related to quantum computing?

Answers and response times (Google Colab) to “What are the ethical challenges related to quantum computing?”

--

--

Jeremy K
The Pythoneers

Innovation specialist, AI expert, passionate about OSINT and new technologies. Sharing knowledge and concrete use cases matter to me.