The RAG Workflow fRAGmented

5 min readJun 19, 2024

Congratulations on your new job! It’s your first day and you are going through the onboarding process. You are provided training materials and documents outlining the company’s policies, processes and guidelines. A fair bit to read but maybe you are impatient and don’t like to read them all. That’s understandable…

But wait! You can just ask a tool like ChatGPT to retrieve this information for you fast! So you provide your question but you get something that sounds believable but nonsensical, known as a hallucination. Hmm, I guess you can’t know everything after all…

You see, these documents are internal to the organisation and thus may not have been exposed to these LLMs (Large Language Models) during their training process. More precisely it is not in their knowledge base so it is natural to make stuff up on something unknown when prompted about this information.

Enter RAG (Retrieval Augmented Generation). This is a way of augmenting the knowledge base of an LLM. It seems that in this application, organisations can benefit from having their own internal chatbots that can save employees the pain of reading through heaps of documents to find what they need. In this article, I wish to take you through how this process works and provide you a basic example using LangChain to see it in action!

How do we add this knowledge to our LLM? The workflow shown in blue handles this:

1) The documents are split into chunks and are converted into a sequence of vectors called embeddings using an embedding model. Embeddings capture the meaning of the text.
2) The resulting vectors are stored into a vector database. These embeddings store many features and thus are very high in dimension. This makes them challenging to represent so vector databases are designed to handle this kind of data.

We can now go ahead with querying for information using RAG shown by the steps in pink:

1) We convert the user query into a sequence of embeddings using the same embedding model.
2–3) These vectors are fed to the retriever and performs a similarity search between the embedded query and the embeddings stored in the vector database. In the context of vectors, we calculate the distance between them using a metric of choosing such as cosine, Euclidean or Manhattan norms.
4–5) Based on the results of similarity search, we extract the relevant information from the vector database and tag this to the original query to provide the context.
6–7) With the context in mind, we can feed this prompt to the LLM and be more confident in it giving us a helpful response.

Want to see it in action? I got you. We can leverage LangChain to build applications with LLMs but we will use it here to demonstrate the RAG process using a PDF file as our document.

I have used Python 3.11 and created a virtual environment to install the following packages:

pip install huggingface-hub
pip install langchain_community
pip install faiss-cpu
pip install sentence-transformers

Now we can import the necessary libraries to set up a vector storage, load the PDF file and use LLM and embedding models.

from langchain_community.llms import HuggingFaceEndpoint
from langchain_community.embeddings import HuggingFaceEmbeddings

from langchain_community.document_loaders import PyPDFLoader

from langchain_community.vectorstores import FAISS 

from langchain.chains import ConversationalRetrievalChain

Next we can load the PDF and split it into pages. We can also instantiate our LLM (being Meta’s Llama3 8B) and an embedding model. Here I will use Hugging Face which is an amazing AI community providing open source models and datasets. To use a Hugging Face LLM with LangChain, I needed to create an API token which is free to do by first creating an account on Hugging Face. Make sure you don’t reveal this token publicly!

loader = PyPDFLoader('sample.pdf')
pages = loader.load_and_split()

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

llm_model = HuggingFaceEndpoint(repo_id='meta-llama/Meta-Llama-3-8B-Instruct',
                                huggingfacehub_api_token='**********',
                                temperature=0.1)

#Let's view an example of a page

print(type(pages[4]))
print(pages[4].page_content)

We see that it is a Document object that describes the contents of the 5th page (since indexing in Python starts with 0).

It contains hardware and software requirements for a client project. Let’s try to get this info from an LLM with an appropriate prompt. But you may wonder what happens when we use the LLM itself without RAG?

Trying to get the client requirements from the LLM alone

A hallucination. The LLM generates something which at the start may sound plausible but note how later on in the response the text becomes nonsensical. In the document there is no mention of the ‘DirectX 9.0c’ graphics card or anything about Java. Not to worry, we can fix this…

Next we can go ahead and pass these pages into the embedding model and store the resulting vectors into the vector database. For demonstration purposes I am using FAISS (Facebook AI Similarity Search).

We also set up a chain where we integrate our vector storage and LLM model for retrieving the relevant information.

vectorstore = FAISS.from_documents(pages, embeddings)
chain = ConversationalRetrievalChain.from_llm(llm=llm_model, retriever=vectorstore.as_retriever())

You ready? We can invoke the chain by passing in our original prompt…

query = "What are the hardware and software requirements for the client?"
response = chain.invoke({'question': query,'chat_history':[]})
print(response['answer'])

Voila! We got the requirements we wanted! It truly is a simple but effective process. We did not have to fine-tune the LLM to this data which can be a computationally expensive task due to their very large architecture.

So by using RAG workflows you can get up to date information and stop yourself from having to wade through pages of onboarding documents. You’re now ready to get to work!

The RAG Workflow fRAGmented

Written by Anuj Singhal