Exploring offline RAG with Langchain, Zephyr-7b-beta and DeciLM-7b

Jeremy K
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨
12 min readDec 26, 2023

In 2023, large language models (LLMs) have taken center stage, with Retrieval Augmented Generation (RAG) becoming the buzzword of recent months. For those unfamiliar, it holds the key to transcending the limitations of LLMs, ushering in new and exciting possibilities.

While there is a plethora of insightful content on mastering RAG with OpenAI available on platforms such as Medium, this article caters to those for whom confidentiality is a top priority. If you’re intrigued by RAG and prefer to work offline, this piece is tailored just for you.

In the upcoming sections, we’ll delve into the intricacies of RAG, its operational mechanisms, and provide you with a foundational implementation using Langchain. This starting point can be further developed and refined to create your own robust and powerful RAG application.

RAG: definition and significance

Limitations of LLMs

Querying chatGPT about what is RAG yields this outcome:

As of my last knowledge update in January 2022, I don’t have specific information on a term or concept called “Retrieval Augmented Generation” (RAG) in the context of a well-established and widely recognized field or technology.

This, in essence, underscores the significance of RAG.

To understand its importance, let’s first examine how LLMs operate.

How LLMs work

The user inputs a query, which the LLM processes to generate the best possible response. However, this approach comes with its set of limitations:

  • LLMs lack real-time information, as evident from chatGPT’s training data only up to January 2022.
  • Your private data, be it corporate or personal, is not within the LLM’s scope. Queries about unobserved data during training yield no results.
  • LLMs have the tendency to generate imaginative responses, as they prioritize providing answers even if it involves creating fictional information.

Why RAG matters

This is where RAG steps in, offering a solution to these constraints by holding the promise of overcoming them. RAG adopts a distinct process by providing context to the LLM in addition to the query:

RAG

In simple terms, RAG enables you to extract pertinent information from your documents, feed it to the LLM, and obtain responses on information unknown to the LLM.

How RAG works

RAG’s operational framework is a bit more intricate than previously outlined, involving specific steps:

RAG process

Chunking

The initial step is to break down documents into smaller, manageable pieces that can be stored and indexed. This process is crucial for several reasons:

  • Context window: LLMs operate within a defined context window, and sending extensive documents may exceed this limit, resulting in no response.
  • Processing time: providing a large volume of documents to an LLM increases processing time. Supplying relevant, smaller chunks optimizes processing.
  • Better answers: restricting input to pertinent information enhances the quality of responses generated by the LLM.

Chunking can be approached in various ways:

  • Creating chunks of predefined sizes (e.g., based on a certain number of characters or tokens).
  • Generating new chunks at every new sentence or paragraph.
  • Designing a custom strategy tailored to specific document characteristics.

The optimal approach depends on the specific case, as different strategies yield diverse results. To preserve information during chunking, incorporating a chunk overlap is crucial. This ensures continuity in meaning between adjacent chunks, preventing loss of context.

To understand precisely chunk size and overlap, I recommend reading this article.

Storing and indexing embeddings

Following chunking, the next step involves extracting document embeddings and storing them into a vector database. They will serve as a foundation for future searches.

Retrieval

When a user poses a question, the relevant context pieces are retrieved. This is achieved through the following process:

  • Extracting embeddings of the query.
  • Searching the vector database using these embeddings to retrieve the top-k chunks, with the option to apply a relevance threshold.

With the selected context pieces, the question can be directed to the LLM.

Asking the LLM

The LLM inquiry involves instructing it on behavior and supplying the context pieces necessary for formulating an answer. The prompt typically resembles:

Use the following pieces of context to answer the question at the end.
If you don’t know the answer, just say that you don’t know,
don’t try to make up an answer.
{context}
Question: {question}
Helpful Answer:

This approach aims to ensure a precise and relevant response, minimizing speculative answers.

Enough with the theory; let’s now explore a tangible implementation.

Implementation of an offline RAG

Material required

If you’ve followed the previous sections, you are aware that implementing offline RAG necessitates a few essential components:

  1. a Large Language Model: we will utilize Zephyr-7b-beta and DeciLM-7b (you only need one, but we test both to assess their performance), chosen for their relative compactness (7 billion parameters) and their current superiority over some larger models. Notably, we’ll store these models locally, ensuring our code can run entirely offline.
  2. a vector database: we will employ FAISS due to its ease of deployment and remarkable speed, making it an optimal choice for our purposes.
  3. an embedding model: we will leverage sentence-transformers, currently recognized as one of the top-performing models in the field. Similar to the LLM, we’ll store this model locally to facilitate offline code execution.
  4. a library for interaction with LLM: we will opt for Langchain, even if there is an ongoing debate between Langchain and LlamaIndex. It’s essential to note that each serves a distinct purpose, and for this specific application, Langchain is an ideal choice.

Getting Started

Now that we have our materials in place, let’s dive into the implementation.

  • Installing dependencies

The following packages are required to set up a RAG application. Please note Langchain is evolving very fast and some pieces of code may need to be adapted with older/newer versions.

pip install  langchain~=0.0.352
pip install pypdf
pip install sentence-transformers==2.2.2
pip install huggingface_hub
pip install accelerate
pip install torch~=2.1.2
pip install transformers~=4.36.2

For FAISS, install the version that fits your hardware:

pip install faiss-gpu
#pip install faiss-cpu.
  • Loading modules
#Langchain modules
from langchain import document_loaders as dl
from langchain import embeddings
from langchain import text_splitter as ts
from langchain import vectorstores as vs
from langchain.llms import HuggingFacePipeline
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.schema import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.runnable import RunnableParallel
from langchain.prompts import PromptTemplate
from operator import itemgetter
#Torch + transformers
import torch
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
#Other useful modules
import re
import time
  • Load a document and chunk it
#A document about quantum computing
document_path ="quantum-mckinsey.pdf"

#we set default chunk size of 500 characters with an overlap of 20 characters
def split_doc(document_path, chunk_size=500, chunk_overlap=20):
loader = dl.PyPDFLoader(document_path)
document = loader.load()
text_splitter = ts.RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
document_splitted = text_splitter.split_documents(documents=document)
return document_splitted

#Split the document and print the different chunks
document_splitted = split_doc(document_path)
for doc in document_splitted:
print(doc)
  • Load the embedding model

If you want to save it locally first, you can do as follows:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sentence-transformers/all-mpnet-base-v2")
#Save the model locally
model.save('sentence-transformers')
#release memory (RAM + cache)
del model
torch.cuda.empty_cache()

Then, you can load it from your local folder:

def load_embedding_model():
model_kwargs = {'device': 'cuda:0'}
encode_kwargs = {'normalize_embeddings': False}
embedding_model_instance = embeddings.HuggingFaceEmbeddings(
#Foldername where the model was stored
model_name="sentence-transformers",
model_kwargs=model_kwargs,
encode_kwargs=encode_kwargs
)
return embedding_model_instance

#Instantiate the embedding model
embedding_model_instance = load_embedding_model()
  • Creating a vector database and storing the chunk embeddings
def create_db(document_splitted, embedding_model_instance):

model_vectorstore = vs.FAISS
db=None
try:
content = []
metadata = []
for d in document_splitted:
content.append(d.page_content)
metadata.append({'source': d.metadata})
db=model_vectorstore.from_texts(content, embedding_model_instance, metadata)
except Exception as error:
print(error)
return db

db = create_db(document_splitted, embedding_model_instance)
#store the db locally for future use
db.save_local('db.index')
  • Load the large language model

To save the model locally:

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceH4/zephyr-7b-beta")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceH4/zephyr-7b-beta", low_cpu_mem_usage=True, torch_dtype=torch.float16)
model.save_pretrained('zephyr-7b-beta-model', max_shard_size="1000MB")
tokenizer.save_pretrained('zephyr-7b-beta-tokenizer')
del model
del tokenizer
torch.cuda.empty_cache()

Then, to load it from the local folder:

tokenizer = AutoTokenizer.from_pretrained("zephyr-7b-beta-tokenizer")
model = AutoModelForCausalLM.from_pretrained("zephyr-7b-beta-model", low_cpu_mem_usage=True, torch_dtype=torch.float16)
pipe = pipeline(task="text-generation", model=model,tokenizer=tokenizer, device="cuda:0", max_new_tokens=1000)
llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0})

The temperature is set to 0 since creativity is not required in RAG; our goal is to obtain an answer aligned with the content of the provided documents.

  • Retrieving pieces of context

After defining a query, we retrieve the top-6 pieces of context in the vector database. According to the use case, it may prove useful to retrieve a different number and adjust the relevance threshold.

query = "What is quantum computing?"
retriever = db.as_retriever(search_type="similarity_score_threshold", search_kwargs={"k": 6, 'score_threshold': 0.01})
retrieved_docs = retriever.get_relevant_documents(query)
  • Defining a prompt template

The prompt template is what will be used to instruct the LLM and fit the context into the query.

template = """Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
{context}
Question: {question}
Helpful Answer:"""
rag_prompt_custom = PromptTemplate.from_template(template)
  • Creating chains to perform RAG

The first chain insert the context into the prompt template and run the query into the LLM. The second chain insert the metadata in the answer.

def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)

#First chain to query the LLM
rag_chain_from_docs = (
{
"context": lambda input: format_docs(input["documents"]),
"question": itemgetter("question"),
}
| rag_prompt_custom
| llm
| StrOutputParser()
)

#Second chain to postprocess the answer
rag_chain_with_source = RunnableParallel(
{"documents": retriever, "question": RunnablePassthrough()}
) | {
"documents": lambda input: [doc.metadata for doc in input["documents"]],
"answer": rag_chain_from_docs,
}
  • Printing the answer
t0=time.time()
resp = rag_chain_with_source.invoke(query)
if len(resp['documents'])==0:
print('No documents found')
else:
stripped_resp = re.sub(r"\n+$", " ", resp['answer'])
print(stripped_resp)
print('Sources',resp['documents'])
print('Response time:', time.time()-t0)

Results: a comparative analysis

Upon employing two distinct LLMs and utilizing a document focused on quantum computing (link), the outcomes for specific questions are detailed below:

  • Zephyr-7b-beta

Question: What is quantum computing?

Answer: Quantum computing is the application of quantum mechanics to computational problems. It uses quantum bits (qubits) as its most basic unit of information, which can assume values that are a combination of both 0 and 1. This characteristic of quantum physics enables new computing algorithms that can massively compress computation time. Quantum computing was first proposed in 1980, and in the past few years, quantum uplift has been observed in certain use cases. Most known use cases fit into four archetypes: quantum simulation, quantum linear algebra for AI and machine learning, quantum optimization and search, and quantum factorization. Quantum computing may be especially valuable in natural language processing, as it can accurately extract meaning from large volumes of text. However, quantum hardware is still immature compared to conventional high-performance computers, and a quantum algorithm that provides a significant quantum speedup may not yet have a practical quantum advantage.

Question: Why do companies should invest in quantum computing?

Answer: According to a recent report by McKinsey, four industries — pharmaceuticals, chemicals, automotive, and finance — could see a value at stake of between $300 billion and $700 billion from short-term benefits of quantum computing. The report suggests that quantum computing could make R&D in pharmaceuticals dramatically faster and more targeted and precise by making target identification, drug design, and toxicity testing less dependent on trial and error and therefore more efficient. The report also highlights that quantum computing could potentially generate half that value in the chemicals industry. The report notes that the estimates are speculative and intended to guide research towards areas of quantum applications with a high value potential, rather than to serve as definitive projections for business value. However, the report suggests that companies should invest in quantum computing due to the potential short-term benefits in these industries.

Question: What should be the way forward for companies that want to invest in quantum computing?

Answer: Companies that want to invest in quantum computing should follow industry developments, screen quantum-computing use cases with an in-house team of quantum-computing experts or collaborate with industry entities, build digital infrastructure that can meet the basic operating demands of quantum computing, make relevant data available in digital databases, and set up conventional computing workflows to be quantum ready once more powerful quantum hardware becomes available. They should also prepare by
integrating the best available quantum hardware into their services and facilitating the execution of hybrid quantum-conventional workflows, as quantum technology will effectively be a coprocessor to conventional
computing infrastructure. Beyond 2030, intense ongoing research by private companies and public institutions will remain vital to improve quantum hardware and enable more — and more complex — use cases.

Question: Who won the world cup?

Answer: no documents found.

  • DeciLM-7b

Even if “DeciLM-7B is the top-performing 7B base language model on the Open LLM Leaderboard”, it brought back lots of hallucinations in its answers, deliberately omitted below for clarity.

Question: What is quantum computing?

Answer: Quantum computing is a novel technology that can help businesses solve problems that are beyond the reach of conventional computing. Introduction.
A fast-developing ecosystem, increasing investment, and accelerating research breakthroughs in quantum computing signal it’s time for executives to consider the technology’s business implications. Quantum computing is a novel technology that can help businesses solve problems that are beyond the reach of conventional computing.

Question: Why do companies should invest in quantum computing?

Answer: Collectively (and conservatively), the value at stake for these industries could be between roughly $300 billion and $700 billion.

[hallucinations repeated several times]

Question: What should be the way forward for companies that want to invest in quantum computing?

Answer: 5. Prepare by building digital infrastructure that can meet the basic operating demands of quantum computing; make relevant data available in digital databases and set up conventional computing workflows to be quantum ready once more powerful quantum hardware becomes available.

[hallucinations repeated several times]

Question: Who won the world cup?

Answer: no documents found.

As illustrated above, Zephyr-7b-beta demonstrated proficiency in providing accurate answers to the posed questions. It effectively extracted the necessary information from the chunks retrieved from the document. It’s crucial to acknowledge that not all LLMs exhibit equal performance in the realm of RAG. For instance, DeciLM-7b displayed a tendency toward hallucinations and, overall, produced responses of comparatively lower quality. This discrepancy underscores the importance of testing and selecting the right LLM for optimal RAG outcomes. Lastly, questions not directly related to the documents were answered “No documents found” as instructed.

Possible improvements and way forward

While the current implementation works effectively, there are several avenues for potential enhancements. Here’s a non-exhaustive list of actions to consider for improving results:

  1. Benchmarking chunk size and overlap: experiment with different chunk sizes and overlaps to determine the optimal values for your specific use case.
  2. Evaluating different chunking strategies: explore various chunking strategies, such as one sentence, one paragraph, or a specific number of tokens, to identify the strategy that yields the best results.
  3. Trying different LLMs and text embedding models: test alternative LLMs and text embedding models to assess their impact on performance and accuracy.
  4. Setting a relevance threshold for retrieval: fine-tune the relevance threshold for document retrieval based on the complexity of the query. Adjust the number of chunks to retrieve accordingly.
  5. Refining the prompt template: improve the prompt template to better handle hallucinations or speculative responses from the language model.
  6. Adding conversation memory in Langchain: consider integrating conversation memory to enable follow-up questions and maintain context over multiple interactions.

Additional Recommendations:

  • Sign up for a training course: Enhance your skills and understanding by enrolling in a training course. Numerous free online courses are available, in order to further elevate your capabilities.
  • Read the Langchain documentation: familiarize yourself with the Langchain documentation to gain deeper insights into its features and capabilities.
  • Explore the possibilities offered by LlamaIndex: investigate the possibilities offered by LlamaIndex and understand how to seamlessly integrate it with Langchain. Exploring complementary tools can expand the capabilities of your RAG system.

These recommendations aim to empower you with the tools and knowledge needed to continually refine and optimize your offline RAG implementation.

Conclusion

In our exploration, we’ve highlighted the significance of RAG and demonstrated its execution in an offline setting. While you are now equipped with the initial code for implementation, the true potential of RAG lies in your hands. It’s an invitation to dive in, experiment, fine-tune parameters and play with LLMs in order to unleash its full power. As you navigate through the code, discovering the optimal configurations, you embark on a journey to unlock the best it has to offer.

--

--

Jeremy K
𝐀𝐈 𝐦𝐨𝐧𝐤𝐬.𝐢𝐨

Innovation specialist, AI expert, passionate about OSINT and new technologies. Sharing knowledge and concrete use cases matter to me.