A politically incorrect, offensive and an annoyed HR assistant — (learning to build a RAG application)

Chintana Wilamuna
5 min readJun 7, 2024

--

Retrieval Augmented Generation (RAG for short) is a technique for building AI applications with LLMs (large language models) that can use a body of knowledge that sits outside and was not part of the data used to train the LLM. This is increasingly becoming the norm because all the enterprise use cases without RAG is very limiting, IMO.

I’m new to AI and LLMs and there’s so many interesting theories, techniques, challenges and applications to learn. I wanted to figure out the mechanics first by coding through a problem.

The use case

In WSO2 (just like in every other company) there are a set of documented processes. Namely I choose the HR related policies and processes. These docs are mostly either in Google Docs or PDFs. While processing text is relatively easy, I wanted to see how accurate and easy it would be to process a bunch of PDFs.

So I thought it’ll be fun to create an HR assistant that understand these docs and answer questions.

When dealing with internal docs, it’s not very wise to use a bunch of cloud services. So I wanted to work with these completely locally.

After a couple of days of tutorials and examples I have a working system that I can wrap my head around to understand the principles of how a typical application would work. My setup looks like below,

Generic steps to build a RAG app (simplified)

This is the generic and high level steps involved in building a RAG app. Please note that this is very basic and doesn’t have a system to compare answers with to make how accurate the answers are. There are bunch of things I probably I don’t know about at this point for a real production RAG app. On with the show,

  1. Load all your docs — meaning extract text data
  2. Split the docs into chunks
  3. Create vector embeddings of the chunks
  4. Save it to a vector database
  5. Get user query and calculate the vector embeddings for the query
  6. From the vector database, get similar chunks that closely match the query
  7. Provide matching chunks as context and get the LLM to answer the query based on the context

Elaborating more on above points

Loading docs

LangChain provides a stellar API to process different kinds of documents with just couple of API calls! LangChain can process PDFs, text, HTML, Microsoft Office docs etc… and extract text data out. I saved all the PDFs into a directory and gave a glob pattern to load everything,

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import PyMuPDFLoader

DOCS_PATH = "policies/"

loader = DirectoryLoader(DOCS_PATH, glob="*.pdf", loader_cls=PyMuPDFLoader)
documents = loader.load()

Splitting docs into chunks

I’m using chunk size 1000 for the time being and it seem to work alright. I still haven’t figured out how this is affecting the accuracy of the responses it generates.

Vector embeddings

Vector embeddings are fascinating things. As humans, when we read two sentences, we can say whether those are closely related or not. A vector embeddings of a sentence a numerical representation that can be used to calculate the similarity with other sentences.

There are vector databases that can store vector embeddings of sentences, or chunks. In this case I’m using Milvus.

LangChain provides a convenient API to calculate embeddings based on the LLM that’s going to be used for the application. In my case, I choose to run everything locally so I’m using Ollama with llama3 model.

from langchain_community.embeddings import OllamaEmbeddings
from langchain_milvus.vectorstores import Milvus

embedding_fn = OllamaEmbeddings(
model="llama3",
base_url="http://localhost:11434/"
)
Milvus.from_documents(
documents=chunks,
embedding=embedding_fn,
collection_name="policy_docs",
connection_args={"address": "localhost:19530"})

Save embeddings to DB

After this step, all of the vector embeddings are in the database. Milvus has a nice GUI to explore what’s stored in the DB. It’s easy to run Milvus and Attu (the GUI) as docker containers. This is the path I took.

Get similar chunks to the user query

When retrieving similar chunks, we have to provide the same embeddings function we used to persist data. This creates a connection to the DB and get a retriever object.

embedding_fn = OllamaEmbeddings(
model="llama3",
base_url="http://localhost:11434/"
)
db = Milvus(
embedding_fn,
connection_args={"host": "127.0.0.1", "port": "19530"},
collection_name="policy_docs",
)
retriever = db.as_retriever()

Prompt template

The next step is to pass the context and question to a prompt template. This is where you would create your prompt and have place holders for the variables.

prompt = PromptTemplate(
template=PROMPT_TEMPLATE,
input_variables=["context", "question"]
)

The RAG chain

Lastly we come to the bit where all the magic happens. This is where the actual similarity search will happen at runtime, substitute the variables in the prompt template and pass it to the LLM for generating a response

rag_chain = (
{"context": retriever, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)

# Stream the output to standard out (terminal)
for chunk in rag_chain.stream(query_text):
print(chunk, end="", flush=True)

The results

With my super scientific testing (read: couple of queries) I can see that the responses fairy hit the mark.

Personality: With some prompting I made the responses to be not something an HR person would say ;-)

Missing from the output above is the metadata or the actual source document. You can also print the source document as references which is useful in a real world application.

Just for giggles

llama3 from my limited prompting, is very conservative in generating profanity. Just for giggles, I used https://www.ollama.com/library/everythinglm and was able to generate some hilarious responses!

--

--