Building Basic RAG — Runbook

7 min readMay 1, 2024

This is the third article in the article series on Retrieval Augmented Generation, Below are links to earlier parts, if you are new to building applications with LLM and RAG, I recommend reading these articles in sequence.

In the previous articles, we built a theoretical foundation of Knowledge augmentation methods for LLM and dived deeper into Retrieval Augmented Generation and how it works with some real life applications of RAG.

In this article, we will get our hands dirty and build simple RAG application.

Note — This article assumes that you have read previous introductory article on RAG

Introduction

In this article, Lets build a simple RAG pipeline chatbot that can answer questions based on some text files. I recommend following along on the code. You can find the jupyter notebook for this article here

By the end of this article, you will have a good overview of how to build a simple RAG chatbot using Langchain and Python.

LangChain is a framework for developing applications powered by large language models (LLMs). if you are new to Langchain, head over this quickstart guide to get just enough understanding

Setup

We will be using OpenAI models so we will also need OPENAI_API_KEY. Here is how you can get one.

For this exercise, we will be using a couple of text files for the demonstration purpose. Those files are available in the data folder of the same repository

Lets start!

Install Necessary Libraries:

!pip install langchain langchain_community langchain_openai chromadb langchainhub

Setup the necessary Environment Variables:

import os
os.environ['OPENAI_MODEL_NAME'] = 'gpt-4-1106-preview'
os.environ['OPENAI_API_KEY'] = 'sk-XXXXXXXXXXX'
os.environ['OPENAI_API_BASE'] = 'https://api.openai.com'

Better way to manage Environment Variables:

Keep these env variables in .env file for better management. You can use python-dotenv library to load the .env file.
.env file should be in the root directory of the project. Following is the example of .env file:

OPENAI_MODEL_NAME=gpt-4-1106-preview
OPENAI_API_KEY=sk-XXXXXXXXXXX
OPENAI_API_BASE=https://api.openai.com

load these env variables using the following code:

from dotenv import load_dotenv, find_dotenv
load_dotenv(find_dotenv())  # read local .env file

Step 1: Ingestion:

As discussed in the previous article, below are step for the ingestion process:

Load / Extract the text
Split the text into chunks
Pass those chunks to Embedding model to Create Embeddings
Store those Embeddings in vector store

1. Load / Extract the text

The first steps is to load text data from the file using the DirectoryLoader from langchain_community. It's a straightforward way to extract the text content from documents present in directory for further processing. The load() method reads the document, and documents[0].page_content[:100] displays the first 100 characters of the text, giving a peek into the loaded data.

from langchain_community.document_loaders.text import TextLoader
from langchain_community.document_loaders.directory import DirectoryLoader

loader = DirectoryLoader('../data', glob="./*.txt", loader_cls=TextLoader)
documents = loader.load()
documents[0].page_content[:100]

2. Split the Data into small chunks:

This segment breaks the text into smaller pieces or chunks using RecursiveCharacterTextSplitter. This is helpful for processing large documents in manageable parts. The chunk_size parameter defines the maximum size of each chunk, while chunk_overlap allows for some overlap between consecutive chunks to ensure continuity in the context. len(docs) shows the total number of chunks created. This is one of the most popular way to create chunks. We will discuss more ways in subsequent articles.

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=10)
docs = text_splitter.split_documents(documents)
len(docs) # Out: 63

3 & 4. Create Embeddings and Store them in Vector Database:

Next, we convert the split documents into embeddings using Embedding model “text-embedding-3-small” from OpenAI and stores these embeddings in a vector store (Chroma).

Chroma is a database for building AI applications with embeddings. It comes with everything you need to get started built in, and runs on your machine.

Embeddings are vector representations of text, useful for various NLP tasks. This process is essential for creating a searchable database of text chunks based on their semantic content.

from langchain_community.vectorstores.chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(documents=docs, embedding=embeddings)

Just the last line in above code block, coverts ‘docs’ into embeddings using embedding modes and stores it in vector database. There are a lot of such abstractions in Langchain.

Until this point in time, its a one time process. You can now make as many queries to this vector store as needed for any downstream tasks.

Retrieval:

Now, since, we have the data and database ready, we can start building retrieval and generation part. So here, whenever user asks a question, we will do following

Convert the question into embedding using same embedding model as used above
Gather approximate nearest neighbours of the query embedding from database
Gather the text chunks that are fed to LLM along with original query
LLM Generates answer to the question

Lets take a look at this in action.

Initialise a Retriever

To be able to fetch the relevant documents, we initialise a retriever from the previously created vectorstore. This retriever is responsible for fetching relevant document chunks based on a given query.

The as_retriever() function turns a database into a tool that a language model can use to look up and retrieve information.

The output from retriever is then formatted so that we can pass it to LLM for generation.

retriever = vectorstore.as_retriever() #initializes a retriever

def format_docs(docs):  
    return "\n\n".join(doc.page_content for doc in docs)  
retrieval_chain = retriever | format_docs # Format docs outputted by retrieval

Generation

Initialise the Large Language Model:

from langchain_openai import ChatOpenAI
llm = ChatOpenAI(temperature=0)

Define a Main chain for RAG:

The RAG chain is defined here, integrating above retriever and document formatting function with prompt, language model, and output parser. This chain outlines the entire process of retrieving context, formatting it, prompting the LLM with this context and a question, and parsing the LLM’s response.

from langchain_core.prompts import ChatPromptTemplate  
from langchain_core.output_parsers import StrOutputParser  
from langchain_core.runnables import RunnablePassthrough

PROMPT = """  
    You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. 
If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} \nContext: {context} 
Answer:  
"""  
  
rag_chain = (  
    {"context": retrieval_chain , "question": RunnablePassthrough()}  
    | ChatPromptTemplate.from_template(PROMPT)  
    | llm  
    | StrOutputParser()  
)

You are ready with basic RAG pipeline!!

Now, If you invoke this LLM chain with question, you will get answers.

rag_chain.invoke("What loan do you offer?")

Out: "Elmwood Banking & Trust offers diverse personal loans with flexible structures and competitive interest rates for various life events. They also offer commercial loans tailored to each business's unique needs for growth and expansion. However, there is no specific mention of a particular loan offered by the bank for the question."

rag_chain.invoke("Do you offer vegetarian food?")

Out: 'Yes, La Bella Vita prides itself on its vegetarian and vegan options, which are crafted with the same attention to detail and flavor as their meat-based counterparts.'

Visualisation of Retriever

1. Visualise Embedding or Vector Space in 3 dimensions.

For the embedding model ‘text-embedding-ada-002’ the number of dimensions are 1536. These are a lot of dimension for a human to possibly visualise. Hence just to understand the concept better, we will apply UMAP transformer on the vector space and reduce these dimensions from 1536 to 3 dimensions.

import umap
import numpy as np
from tqdm import tqdm

doc_strings = [doc.page_content for doc in docs]
vectors = embeddings.embed_documents(doc_strings)
# umap_transformer = umap.UMAP(random_state=0, transform_seed=0).fit(vectors) # For 2 dimensions
umap_transformer = umap.UMAP(random_state=0, transform_seed=0, n_components=3).fit(vectors) # For 3 dimensions

def umap_embed(vectors, umap_transformer):
    umap_embeddings = np.array([umap_transformer.transform([vector])[0] for vector in tqdm(vectors)])
    return umap_embeddings
global_embeddings = umap_embed(vectors, umap_transformer)
global_embeddings

from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
ax.scatter(global_embeddings[:, 0], global_embeddings[:, 1], global_embeddings[:, 2], s=10)
ax.set_title('Embeddings')
# plt.axis('off')

In the above plot, we can see that there are two distinct group of vectors. These two distinct group arise from the fact that they are from different files containing completely different texts. This is a really nice way to visualise your vector space. However, keep in mind that you are loosing most of the details when transforming it to 3 dimensions.

Next, lets try to visualise the retriever process. The following function calc_global_embeddings embeds the query and get relevant docs. It then applies marking for those embeddings in the above plot.

def calc_global_embeddings(query, embeddings, retriever, umap_transformer, embed_function, global_embeddings):
    q_embedding = embeddings.embed_query(query)

    docs = retriever.get_relevant_documents(query)
    page_contents = [doc.page_content for doc in docs]
    vectors_content_vectors = embeddings.embed_documents(page_contents)

    query_embeddings = embed_function([q_embedding], umap_transformer)
    retrieved_embeddings = embed_function(vectors_content_vectors, umap_transformer)

    fig = plt.figure()
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(global_embeddings[:, 0], global_embeddings[:, 1], global_embeddings[:, 2], s=10, color='gray')
    ax.scatter(query_embeddings[:, 0], query_embeddings[:, 1], query_embeddings[:, 2], s=150, marker='X', color='r')
    ax.scatter(retrieved_embeddings[:, 0], retrieved_embeddings[:, 1], retrieved_embeddings[:, 2], s=50, facecolors='none', edgecolors='g')
    ax.set_title(f'{query}')
    # plt.axis('off')
    plt.show()

calc_global_embeddings("What loan do you offer?", embeddings, retriever, umap_transformer, umap_embed, global_embeddings)

If you ask a completely different question, which is based out of a different document, you will see that it maps query and relevant chunks from a different part of the vector space.

calc_global_embeddings("Do you offer vegetarian food?", embeddings, retriever, umap_transformer, umap_embed,
                       global_embeddings)

Conclusion

We built a simple RAG application using Langchain and Python. Also we used UMAP to transform the embeddings in 3 dimensions to visualise the Query and how it brings about the relevant documents to generate the answer for a query.

Here you can find the entire code as jupyter notebook to play around.

References:

Retriever Visualisation in this blog was taken from this video