LLM Engineering: Implementing RAG on LangChain

Published in

TechHappily

7 min readJan 2, 2024

In this article, we delve into the fundamental steps of constructing a Retrieval Augmented Generation (RAG) on top of the LangChain framework. We will be using Llama 2.0 for this implementation, but feel free to use any open-source or proprietary LLM that you like.

Why is RAG needed?

The first question one would ask is, why do we need this paradigm when things are for the most part progressing well in the LLM world, with highly competitive open-source LLMs coming into foray, and the engineering ecosystem that is developing around it.

To answer that, we would have to delve into the definition of RAG (or rather what is RAG?)

At its core, RAG paradigm incorporates a retrieval mechanism on top off pre-trained LLMs to that allows the model to access external knowledge sources, in order to get better context for answering queries.

To put it technically, we attach an external vector database on top of LLM that would guide it in such a way that would help the model answer the questions in a better and more informed manner.

An analogy can be made with an open-book examination, a student has been given access to reference books and materials apart from the knowledge they have gained by attending all the lectures and reading the class notes. Similarly, in an RAG-based pipeline, LLM would get an external reference material (context) that can help it perform the task better.

The need for Retrieval Augmented Generation (RAG) arises from the fact that large language models rely solely on generative approaches.

Let's say an LLM that is trained on the data available till 2021 will only be able to give you information based on that period. It will tend to hallucinate if it is unable to find the right response or closest matching response. In some use cases, if we need the latest development in a specific subject for generating insights via LLM, it's not feasible for us to train an LLM from scratch (would cost us millions of dollars!)

To summarize the issues, here are the potential challenges that large language models face despite being so impressive at the majority of tasks.

Temporal Constraints: LLMs trained on data up to a certain point, such as 2021, possess temporal limitations. They can provide information based only on the knowledge available up to that period, hindering their ability to offer insights into more recent developments.
Risk of Hallucinations: LLMs are susceptible to generating inaccurate or speculative information, especially when confronted with queries outside their designated domains. This phenomenon, known as hallucination, poses a risk of producing misleading or unfounded responses.
Dynamic Knowledge Requirements: Certain use cases demand real-time and up-to-date information, particularly in fields experiencing continuous advancements.

In essence, RAG models serve as a bridge between historical knowledge encoded during training and the evolving nature of real-world information. This paradigm ensures that the model not only retains its generative capabilities but also adapts to the dynamic and ever-changing landscape of knowledge.

Architecture of RAG Pipeline

In this section, we will be focusing on understanding the workings of the RAG pipeline, and how it is different from a vanilla LLM pipeline.

A vanilla LLM pipeline is a linear process, as we can see in the image above. An input query gets encoded by an embedding model which then serves as an input to the model. The limitations of this flow are discussed in the above section.

To implement the RAG paradigm, we have to add an external source of knowledge into this equation, which would mean adding a vector database.

Vector Database would include the external knowledge sources that the user would add separately — It could be a text file, pdf, HTML page, Web page, string, etc. This external data would then be embedded and stored as chunks in the VectorDB.

The input query then would hit the retriever which could semantically search the closest context from VectorDB. This would result in the prompt having an additional context to help LLM guide through the question-answering.

Implementing RAG in LangChain

In our use case, we will be giving website sources to the retriever that will act as an external source of knowledge for LLM.

from langchain.llms import LlamaCpp
from langchain_core.output_parsers import StrOutputParser
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.chains import LLMChain
from langchain.llms import LlamaCpp
from langchain.prompts import PromptTemplate

# Initiating the LLM
n_gpu_layers = 1  # Metal set to 1
n_batch = 512  # Should be between 1 and n_ctx, consider the amount of RAM of your Apple Silicon Chip.
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Make sure the model path is correct for your system!
model = LlamaCpp(
    model_path="./llama2/llama.cpp/models/7B/llama-2-7b-chat.Q5_0.gguf",
    n_gpu_layers=n_gpu_layers,
    n_batch=n_batch,
    n_ctx=2048,
    f16_kv=True,  # MUST set to True, otherwise you will run into problem after a couple of calls
    callback_manager=callback_manager,
    verbose=True,
)

I have used LlamaCpp abstraction to instantiate a quantized version of Llama-2–7b, whilst setting all the hyperparameters with it.

To use HTTP websites as external sources, we need to install Playwright, it’s a Python library to automate Chromium, Firefox and WebKit browsers with a single API.

On your terminal run the following commands

pip install playwright
playwright install

Once the playwright is successfully installed, we move on to the next most integral step in the process — loading HTTP websites in the vector database, we will be using Chroma for our use case. (Feel free to use any vectordb of your choice!)

We are using an embedding model from Hugging-Face Transformer: sentence-transformers/all-mpnet-base-v2 that would enable our external information to be encoded.

from langchain.llms import HuggingFacePipeline
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_loaders import AsyncChromiumLoader
from langchain.document_transformers import Html2TextTransformer
from langchain.vectorstores import Chroma
import nest_asyncio

nest_asyncio.apply()
# Articles to index
articles = ["https://www.medicalnewstoday.com/human-biology/",
           "https://www.trivianerd.com/topic/human-body-trivia/",
           "https://www.watercoolertrivia.com/trivia-questions/anatomy-trivia-questions"
 ]

# Scrapes the blogs above
loader = AsyncChromiumLoader(articles)
docs = loader.load()

# Converts HTML to plain text 
html2text = Html2TextTransformer()
docs_transformed = html2text.transform_documents(docs)

# Chunk text
text_splitter = CharacterTextSplitter(chunk_size=128, 
                                      chunk_overlap=0)
chunked_documents = text_splitter.split_documents(docs_transformed)

# Load chunked documents into the FAISS index
db = Chroma.from_documents(chunked_documents, 
                          HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2'))

retriever = db.as_retriever()

As you can notice, we are building an LLM that has expertise in human anatomy and biology and is up with all the domain-specific trivia, which the general LLM might not have an innate expertise with.

from langchain_core.runnables import RunnablePassthrough

prompt_template= """
### [INST] 
Instruction: Answer the question based on your 
human biology and anatomy knowledge. Here is context to help:

{context}

### QUESTION:
{question} 

[/INST]
"""
 
# Abstraction of Prompt
prompt = ChatPromptTemplate.from_template(prompt_template)
output_parser = StrOutputParser()

# Creating an LLM Chain 
llm_chain = LLMChain(llm=model, prompt=prompt)

# RAG Chain
rag_chain = ( 
 {"context": retriever, "question": RunnablePassthrough()}
    | llm_chain
)

The first step is to write a prompt template prescribed as per the Llama syntax, which would have two inputs: context and question.
Context comes from VectorDB, whereas questions come from human input. Using LangChain Expression Language (LECL), we create a RAG Chain that encompasses the base model along with a vector database.

query_1 = "What are two different types of receptor cells present in human retina?"

rag_chain.invoke(query_1)

# Output: 
# Based on my knowledge of human biology and anatomy, the answer to the question is:
# The two different types of receptor cells present in the human retina are:
# 1. Rod cells: These are sensitive to low light levels and are responsible for peripheral and night vision. They are found in the outer layers of the retina and are more numerous than cone cells.
# 2. Cone cells: These are sensitive to higher light levels and are responsible for color vision and central vision. There are three types of cone cells, each sensitive to a different range of light frequencies, allowing for color vision. They are found in the inner layers of the retina

query_2 = "Which part of the body is comprised of the carpus and metacarpus?"

rag_chain.invoke(query_2)

# Output: 
# Based on my knowledge of human anatomy, the part of the body that is comprised of the carpus and metacarpus is the hand. 
# The carpus is the wrist bone, and the metacarpus is the bones in the palm of the hand.

query_3 = "The average person has how many sweat glands?"

rag_chain.invoke(query_3)

# Output: 
# According to the documents provided, the average person has approximately 2 million sweat glands.
# Document 1 states that the human body has 2.5 million sweat pores, but does not provide an exact number for the number of sweat glands. However, it is estimated that each sweat pore contains one to three sweat glands, so the total number of sweat glands can be estimated by multiplying the number of sweat pores by the number of sweat glands per pore.
# Document 2 provides an answer of 2 million sweat glands, which is supported by Document 3, which states that the average nose produces about a cupful of nasal mucus every day, which can be used to estimate the total surface area of the human body and therefore the number of sweat glands.
# Therefore, based on the information provided, the average person has approximately 2 million sweat glands.

As we can notice in query_3, it has provided various external references that make for an informed response. To add to that, it has meticulously provided a well-detailed response to query_1 and query_2

This sums up our implementation of RAG on LangChain, although a rookie topic to tackle in our article, we were more interested in knowing the implementation arm of the RAG paradigm and how effective it can be if an application surrounding this can be built for enterprise-level problems in various sectors.

Conclusion

In conclusion, LangChain offers a swift capability to implement a RAG-based LLM with few lines of code, that opens up a different way to tackle a problem statement in the AI industry.

LLM Engineering: Implementing RAG on LangChain

Why is RAG needed?

Architecture of RAG Pipeline

Implementing RAG in LangChain

Conclusion

Written by Kushal V