RAG chatbot powered by Langchain, OpenAI, Google Generative AI and Hugging Face APIs

Ala Eddine GRINE
The Deep Hub
Published in
15 min readFeb 13, 2024
LLM providers | Image by Author

Introduction

Although Large Language Models (LLMs) are powerful and capable of generating creative content, they can produce outdated or incorrect information as they are trained on static data. To overcome this limitation, Retrieval Augmented Generation (RAG) systems can be used to connect the LLM to external data and obtain more reliable answers.

Here is an example: In their recent paper, the authors introduce “Diffuse to Choose”, a novel image-conditioned inpainting model based on diffusion.

I asked ChatGPT what “Diffuse to Choose” means. Here is the answer.

“Diffuse to Choose” is not a commonly used phrase or term. Without further context, it is difficult to determine its exact meaning.

The aim of this project is to build a RAG chatbot in Langchain powered by OpenAI, Google Generative AI, and Hugging Face APIs. Documents in txt, pdf, CSV, or docx format can be uploaded and relevant documents are retrieved and sent to the LLM along with follow-up questions for accurate answers.

This article delves into each component of the RAG system, from the document loader to the conversational retrieval chain. Additionally, a user interface is developed using the Streamlit application.

Conversational RAG Architecture

Here is an illustration of the architecture and the workflow of the RAG chatbot that we will be building using Langchain.

Retrieval Augmented Generation (RAG) system | Image created by the Author using Lucid

There are two main blocks in the RAG architecture:

  • The first block includes a document loader, text splitter, vector store, and retriever. It loads external documents, converts them into numerical representations (embeddings), and stores them in a vector store, such as a Chroma vector database.
  • The second block comprises LLMs, memory, and prompt templates. It interfaces with the retriever to retrieve documents similar to the query, augments the LLM prompt with these documents, and communicates with the LLM to obtain an accurate answer.

Here are the main steps in the RAG workflow:

  • (1), (2), (3) and (4): The standalone question prompt is formatted with the follow-up question and the chat history and passed to the LLM, which rephrases the follow-up question to get a standalone question.
    For example, suppose the chat history consists of the user’s question “What does DTC stand for?” and the AI’s answer “DTC stands for Diffuse to Choose”. The follow-up question is “Please provide more details about it, including its use cases and implementation”. The LLM rephrases this question and replaces “it” with “DTC” to obtain the following standalone question: “What are the use cases and implementation of Diffuse to Choose (DTC)?”
  • (5), (6), (7) and (8): The next step is to search for relevant information. The retriever compares the embeddings of the standalone question with the Chroma vectorstore. Relevant documents are retrieved.
  • (9) and (10): The LLM prompt is augmented with retrieved documents (and chat history) and passed to the LLM in order to obtain a reliable answer.
  • (11): The memory is updated with the follow-up question and the AI’s answer.

We will dive deeper into each component and each step in the following sections.

Conversational RAG Implementation

Let’s start with Retrieval. It includes document loaders, text splitting into chunks, vector stores and embeddings, and finally, retrievers.

Document loaders

LangChain offers more than 80 document loaders to simplify the process of loading data from various sources. These sources include the web, cloud services like AWS S3, local files (such as CSV and JSON), git, emails, and more.

To retrieve files from a temporary directory (TMP_DIR) for our application, we will use the DirectoryLoader. This loader can handle files in txt, pdf, CSV, or docx format. We can filter out these formats using the glob parameter. The loader_cls parameter defines the loader class for each format. For example, TextLoader is used for txt files.

Below is a code snippet for loading documents.

def langchain_document_loader(TMP_DIR):
"""
Load documents from the temporary directory (TMP_DIR).
Files can be in txt, pdf, CSV or docx format.
"""

documents = []

txt_loader = DirectoryLoader(
TMP_DIR.as_posix(), glob="**/*.txt", loader_cls=TextLoader, show_progress=True
)
documents.extend(txt_loader.load())

pdf_loader = DirectoryLoader(
TMP_DIR.as_posix(), glob="**/*.pdf", loader_cls=PyPDFLoader, show_progress=True
)
documents.extend(pdf_loader.load())

csv_loader = DirectoryLoader(
TMP_DIR.as_posix(), glob="**/*.csv", loader_cls=CSVLoader, show_progress=True,
loader_kwargs={"encoding":"utf8"}
)
documents.extend(csv_loader.load())

doc_loader = DirectoryLoader(
TMP_DIR.as_posix(),
glob="**/*.docx",
loader_cls=Docx2txtLoader,
show_progress=True,
)
documents.extend(doc_loader.load())
return documents

Text Splitters

The text splitter divides documents into smaller sections that fit within the model’s context window. In Langchain, we can split by token, character, or even split code such as Java, JavaScript, and PHP.

For generic text it is recommended to use the RecursiveCharacterTextSplitter, as it preserves the semantic relationship between paragraphs, sentences and words by keeping them together as much as possible.

To ensure consistency, we use a small overlap between two chunks. This allows for the same context to be found at the end of one chunk and the start of the other.

text_splitter = RecursiveCharacterTextSplitter(
separators = ["\n\n", "\n", " ", ""],
chunk_size = 1600,
chunk_overlap= 200
)

# Text splitting
chunks = text_splitter.split_documents(documents=documents)

Vectorsores and Embeddings

Now that the documents are divided into small, meaningful chunks, we can retrieve the chunks that are most similar to the query. We will first generate embeddings for these chunks and then store them in a vectorstore.

Text embeddings

Embeddings are numerical representations of text data in a high-dimensional vector space. For instance, the size of the embeddings vector size for OpenAI’s text-embedding-ada-002 model is 1536.

To identify the most similar documents to a query, we can search for vectors with the highest similarity to the query’s embeddings. Cosine similarity is commonly used to measure the similarity between two vectors.

OpenAI, Google Generative AI and Huggin Face offer distinct embedding models. In Langchain, we can connect to the embeddings API endpoint by specifying the name of the embedding model. The following models will be used:

Embedding models | Image by Author
def select_embeddings_model(LLM_service="OpenAI"):
"""Connect to the embeddings API endpoint by specifying
the name of the embedding model."""
if LLM_service == "OpenAI":
embeddings = OpenAIEmbeddings(
model='text-embedding-ada-002',
api_key=openai_api_key)

if LLM_service == "Google":
embeddings = GoogleGenerativeAIEmbeddings(
model="models/embedding-001",
google_api_key=google_api_key
)
if LLM_service == "HuggingFace":
embeddings = HuggingFaceInferenceAPIEmbeddings(
api_key=HF_key,
model_name="thenlper/gte-large"
)

return embeddings

Vectorstores

A vectorstore is a database used to store embedding vectors. It allows for searching vectors that are most similar to the query’s embeddings.

There are several open-source options for vector storage. We will use the Chroma vector database.

def create_vectorstore(embeddings,documents,vectorstore_name):
"""Create a Chroma vector database."""
persist_directory = (LOCAL_VECTOR_STORE_DIR.as_posix() + "/" + vectorstore_name)
vector_store = Chroma.from_documents(
documents=documents,
embedding=embeddings,
persist_directory=persist_directory
)
return vector_store

Relevant documents are retrieved from the vectorstore by comparing the query’s embeddings to all the vectors in the vectorstore using cosine similarity.

Retrievers

The retriever returns relevant documents to a query.

We will begin with a basic retriever, the Vectorstore-backed retriever.

Vectorstore-backed retriever

It uses semantic search to retrieve documents from a Vectorstore. It can perform three types of search:

  • Similarity search: it returns the k most similar vectors.
  • Maximum marginal relevance search (MMR): it’s used to ensure both similarity to the query and diversity of the selected documents.
  • Similarity_score_threshold: defines the minimum relevance threshold.
def Vectorstore_backed_retriever(
vectorstore,search_type="similarity",k=4,score_threshold=None
):
"""create a vectorsore-backed retriever
Parameters:
search_type: Defines the type of search that the Retriever should perform.
Can be "similarity" (default), "mmr", or "similarity_score_threshold"
k: number of documents to return (Default: 4)
score_threshold: Minimum relevance threshold for similarity_score_threshold (default=None)
"""
search_kwargs={}
if k is not None:
search_kwargs['k'] = k
if score_threshold is not None:
search_kwargs['score_threshold'] = score_threshold

retriever = vectorstore.as_retriever(
search_type=search_type,
search_kwargs=search_kwargs
)
return retriever

When using a Vectorstore-backed retriever to retrieve documents, irrelevant information that is not related to the query context is often included. Removing this irrelevant information can lead to more cost-effective and accurate LLM calls.

Contextual Compression Retriever

The Contextual Compression Retriever removes irrelevant information from retrieved documents.
It first passes the query to the base retriever (vectorstore-backed retriever), which returns initial documents.
The initial documents are then passed through a Document Compressor, which reduces or eliminates them.

The Document Compressor can make an LLM call to perform contextual compression on each retrieved document. This process can be slow and expensive.

Instead, we will create a Document Compressor Pipeline as follows:

  1. Transform the initial retrieved documents by splitting them into smaller chunks using CharacterTextSplitter with a small chunk size.
  2. Filter out redundant chunks using EmbeddingsRedundantFilter.
  3. Use EmbeddingsFilter to filter out the most relevant chunks to the query using the parameters similarity_threshold and k. We will set k to 16.
  4. Reorder the chunks documents LongContextReorder, so that more relevant elements will be at the top and bottom of the list. This will improve LLM performance, as explained here.
from langchain.retrievers.document_compressors import DocumentCompressorPipeline
from langchain.text_splitter import CharacterTextSplitter
from langchain_community.document_transformers import EmbeddingsRedundantFilter,LongContextReorder
from langchain.retrievers.document_compressors import EmbeddingsFilter
from langchain.retrievers import ContextualCompressionRetriever

def create_compression_retriever(embeddings, base_retriever, chunk_size=500, k=16, similarity_threshold=None):
"""Build a ContextualCompressionRetriever.
We wrap the the base_retriever (a vectorstore-backed retriever) into a ContextualCompressionRetriever.
The compressor here is a Document Compressor Pipeline, which splits documents
into smaller chunks, removes redundant documents, filters out the most relevant documents,
and reorder the documents so that the most relevant are at the top and bottom of the list.

Parameters:
embeddings: OpenAIEmbeddings, GoogleGenerativeAIEmbeddings or HuggingFaceInferenceAPIEmbeddings.
base_retriever: a vectorstore-backed retriever.
chunk_size (int): Documents will be splitted into smaller chunks using a CharacterTextSplitter with a default chunk_size of 500.
k (int): top k relevant chunks to the query are filtered using the EmbeddingsFilter. default =16.
similarity_threshold : minimum relevance threshold used by the EmbeddingsFilter. default =None.
"""

# 1. splitting documents into smaller chunks
splitter = CharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=0, separator=". ")

# 2. removing redundant documents
redundant_filter = EmbeddingsRedundantFilter(embeddings=embeddings)

# 3. filtering based on relevance to the query
relevant_filter = EmbeddingsFilter(embeddings=embeddings, k=k, similarity_threshold=similarity_threshold) # similarity_threshold and top K

# 4. Reorder the documents

# Less relevant document will be at the middle of the list and more relevant elements at the beginning or end of the list.
# Reference: https://python.langchain.com/docs/modules/data_connection/retrievers/long_context_reorder
reordering = LongContextReorder()

# 5. Create compressor pipeline and retriever

pipeline_compressor = DocumentCompressorPipeline(
transformers=[splitter, redundant_filter, relevant_filter, reordering]
)
compression_retriever = ContextualCompressionRetriever(
base_compressor=pipeline_compressor,
base_retriever=base_retriever
)

return compression_retriever

Cohere Reranker

We will use the Cohere rerank endpoint to re-order the results based on their semantic relevance to the query.
We will wrap our base retriever with a ContextualCompressionRetriever. The compressor here is the Cohere Reranker.

from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import CohereRerank
from langchain_community.llms import Cohere

def CohereRerank_retriever(
base_retriever,
cohere_api_key,cohere_model="rerank-multilingual-v2.0", top_n=8
):
"""Build a ContextualCompressionRetriever using Cohere Rerank endpoint to reorder the results based on relevance.
Parameters:
base_retriever: a Vectorstore-backed retriever
cohere_api_key: the Cohere API key
cohere_model: The Cohere model can be either 'rerank-english-v2.0' or 'rerank-multilingual-v2.0', with the latter being the default.
top_n: top n results returned by Cohere rerank, default = 8.
"""

compressor = CohereRerank(
cohere_api_key=cohere_api_key,
model=cohere_model,
top_n=top_n
)

retriever_Cohere = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=base_retriever
)
return retriever_Cohere

Conversational retrieval Chain with memory

After retrieving the most relevant documents, the next step is to add them to the LLM prompt, which will be sent to the LLM.

The main components to be used here are: ChatModel, PromptTemplate, Memory and ConversationalRetrievalChain.

ChatModel

The ChatModel interacts with LLMs, such as GPT3.5-turbo and Google gemini-pro.

We will use the OpenAI, Google and Hugging Face APIs, and leverage the ChatOpenAI, ChatGoogleGenerativeAI and HuggingFaceHub Langchain classes to instantiate the following pre-trained models:

Pre-trained LLMs | Image by Author

While GPT-4 has 1.8 trillion parameters, Mistral-7B-Instruct-v0.2, an open-source model from Hugging Face, has 7 billion parameters. Despite its smaller size, Mistral still delivers strong performance compared to larger models.

The chat model can be instantiated as follows:

def instantiate_LLM(LLM_provider,api_key,temperature=0.5,top_p=0.95,model_name=None):
"""Instantiate LLM in Langchain.
Parameters:
LLM_provider (str): the LLM provider; in ["OpenAI","Google","HuggingFace"]
model_name (str): in ["gpt-3.5-turbo", "gpt-3.5-turbo-0125", "gpt-4-turbo-preview",
"gemini-pro", "mistralai/Mistral-7B-Instruct-v0.2"].
api_key (str): google_api_key or openai_api_key or huggingfacehub_api_token
temperature (float): Range: 0.0 - 1.0; default = 0.5
top_p (float): : Range: 0.0 - 1.0; default = 1.
"""
if LLM_provider == "OpenAI":
llm = ChatOpenAI(
api_key=api_key,
model=model_name, # in ["gpt-3.5-turbo", "gpt-3.5-turbo-0125", "gpt-4-turbo-preview"]
temperature=temperature,
model_kwargs={
"top_p": top_p
}
)
if LLM_provider == "Google":
llm = ChatGoogleGenerativeAI(
google_api_key=api_key,
model=model_name, # "gemini-pro"
temperature=temperature,
top_p=top_p,
convert_system_message_to_human=True
)
if LLM_provider == "HuggingFace":
llm = HuggingFaceHub(
repo_id=model_name, # "mistralai/Mistral-7B-Instruct-v0.2"
huggingfacehub_api_token=api_key,
model_kwargs={
"temperature":temperature,
"top_p": top_p,
"do_sample": True,
"max_new_tokens":1024
},
)
return llm

The main parameters that we can adjust are:

  • temperature: controls the degree of randomness in token selection. Higher values increase diversity, and hence, creativity.
  • top_p: The cumulative probability cutoff for token selection. Higher values increase diversity.

We alse need four API keys:

First, we need to add the environment variables: OPENAI_API_KEY, GOOGLE_API_KEY, HUGGINGFACEHUB_API_TOKEN and COHERE_API_KEY.

Next, we load them as follows.

def get_environment_variable(key):
if key in os.environ:
value = os.environ.get(key)
print(f"\n[INFO]: {key} retrieved successfully.")
else :
print(f"\n[ERROR]: {key} is not found in your environment variables.")
value = getpass(f"Insert your {key}")
return value

openai_api_key = get_environment_variable("OPENAI_API_KEY")
google_api_key = get_environment_variable("GOOGLE_API_KEY")
HF_key = get_environment_variable("HUGGINGFACEHUB_API_TOKEN")
cohere_api_key = get_environment_variable("COHERE_API_KEY")

Memory

The memory enables the storage of chat history. This can range from a simple buffer, to storing the last K interactions or tokens, or summarising the conversation or part of it.

By default, we will use a simple ConversationBufferMemory.

For LLMs with a small maximum number of tokens, such as gpt-3–5-turbo, we will use a conversation summary buffer. This buffer keeps track of recent interactions based on tokens and summarizes older messages.

from langchain.memory import ConversationSummaryBufferMemory,ConversationBufferMemory

def create_memory(model_name='gpt-3.5-turbo',memory_max_token=None):
"""Creates a ConversationSummaryBufferMemory for gpt-3.5-turbo.
Creates a ConversationBufferMemory for the other models."""

if model_name=="gpt-3.5-turbo":
if memory_max_token is None:
memory_max_token = 1024 # max_tokens for 'gpt-3.5-turbo' = 4096
memory = ConversationSummaryBufferMemory(
max_token_limit=memory_max_token,
llm=ChatOpenAI(model_name="gpt-3.5-turbo",openai_api_key=openai_api_key,temperature=0.1),
return_messages=True,
memory_key='chat_history',
output_key="answer",
input_key="question"
)
else:
memory = ConversationBufferMemory(
return_messages=True,
memory_key='chat_history',
output_key="answer",
input_key="question",
)
return memory
  • The return_messages is set to True so that a list of chat messages is returned (by default, these messages are concatenated and returned as a single string).
  • The output_key is set to 'answer', and the input_key is set to 'question'. This enables the memory to keep track of and save user questions and AI answers.

memory.save_context(inputs={"question":"..."},outputs={"answer":"...."}

Prompt templates

Prompt templates generate prompts for the LLM. For our RAG chatbot, we need two templates.

  • The first template asks the LLM to generate a standalone question given the chat history and a follow-up question. The PromptTemplate uses this template to return a string prompt when invoked.
standalone_question_template = """Given the following conversation and a follow up question, 
rephrase the follow up question to be a standalone question, in its original language.\n\n
Chat History:\n{chat_history}\n
Follow Up Input: {question}\n
Standalone question:"""

standalone_question_prompt = PromptTemplate(
input_variables=['chat_history', 'question'],
template=standalone_question_template
)
  • The second template includes a placeholder for the context (i.e. retrieved documents), chat history, and the user’s question. It instructs the LLM to answer the question based solely on the provided context. The ChatPromptTemplate uses this template to return a list of chat messages.
def answer_template(language="english"):
"""Pass the standalone question along with the chat history and context
to the `LLM` wihch will answer"""

template = f"""Answer the question at the end, using only the following context (delimited by <context></context>).
Your answer must be in the language at the end.

<context>
{{chat_history}}

{{context}}
</context>

Question: {{question}}

Language: {language}.
"""
return template

ConversationalRetrievalChain

All the components we have built so far, including the Retriever, ChatModel or LLM, Memory and Prompts, come in handy here as they are passed as parameters to the ConversationalRetrievalChain.

The ConversationalRetrievalChain is a built-in chain that "chains" our components together, allowing us to chat with our documents. First, it passes the follow-up question and chat history to the LLM, which rephrases the question and creates a standalone query. The Retriever then retrieves relevant documents (context) based on this query. These documents, along with the standalone question and chat history, are then passed to the LLM for answering.

Here is an example using the Gemini API.

chain = ConversationalRetrievalChain.from_llm(
condense_question_prompt=standalone_question_prompt,
combine_docs_chain_kwargs={'prompt': answer_prompt},
condense_question_llm=instantiate_LLM(
LLM_provider="Google",api_key=HF_key,temperature=0.1,
model_name="gemini-pro"),
memory=create_memory("gemini-pro"),
retriever = retriever,
llm=instantiate_LLM(
LLM_provider="Google",api_key=HF_key,temperature=0.5,
model_name="gemini-pro"),
chain_type= "stuff",
verbose= False,
return_source_documents=True
)

The temperature of the condense_question_llm is set to 0.1, making the LLM more deterministic.
To generate answers, we use an LLM with a higher temperature, which allows for more creativity.

A step-by-step approach to the Conversational Retrieval Chain

In the previous section, we leveraged the built-in ConversationalRetrievalChain to create our RAG model.

This section will use a step-by-step approach to help us understand what’s happening under the hood and customize the RAG chain. For instance, we can return the standalone question and format retrieved documents.

First step: Create a standalone_question chain

Here, we load the chat history and pass it along with the follow-up question to the LLM. The LLM combines them and generates a standalone question (new query).

# 1. load memory using RunnableLambda. Retrieves the chat_history attribute using itemgetter.
# `RunnablePassthrough.assign` adds the chat_history to the assign function

loaded_memory = RunnablePassthrough.assign(
chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("chat_history"),
)

# 2. Pass the follow-up question along with the chat history to the LLM, and parse the answer (standalone_question).

condense_question_prompt = PromptTemplate(
input_variables=['chat_history', 'question'],
template=standalone_question_template
)

condense_question_llm = instantiate_LLM(
LLM_provider="Google",api_key=google_api_key,temperature=0.1,
model_name="gemini-pro"
)

standalone_question_chain = {
"standalone_question": {
"question": lambda x: x["question"],
"chat_history": lambda x: get_buffer_string(x["chat_history"]),
}
| condense_question_prompt
| condense_question_llm
| StrOutputParser(),
}

# 3. Combine load_memory and standalone_question_chain

chain_question = loaded_memory | standalone_question_chain

The | symbol enables the chaining of components, where the output of one component serves as the input for the next.

This is an example of how to invoke chain_question:

memory.clear()
memory.save_context(
{"question": "What does DTC stand for?"},
{"answer": "Diffuse to Choose."}
)
print("Chat history:\n",memory.load_memory_variables({}))

follow_up_question = "plaese give more details about it, including its use cases and implementation."
print("\nFollow-up question:\n",follow_up_question)

# invoke chain_question
response = chain_question.invoke({"question":follow_up_question})["standalone_question"]
print("\nStandalone_question:\n",response)

Here is the response:

Chat history:
{'chat_history':
[HumanMessage(content='What does DTC stand for?'),
AIMessage(content='Diffuse to Choose.')
]
}

Follow-up question:
plaese give more details about it, including its use cases
and implementation.

Standalone_question:
What are the use cases and implementation of Diffuse to Choose (DTC)?

We can see that “it” has been replaced by “Diffuse to Choose (DTC)”, to which it refers.

Seconde step: Retrieve documents, pass them to the LLM along with the standalone question and chat history, and parse the response.

from langchain.schema import Document

def _combine_documents(docs, document_prompt, document_separator="\n\n"):
doc_strings = [format_document(doc, document_prompt) for doc in docs]
return document_separator.join(doc_strings)

# 1. Retrieve relevant documents

retrieved_documents = {
"docs": itemgetter("standalone_question") | retriever,
"question": lambda x: x["standalone_question"],
}

# 2. Get variables ['chat_history', 'context', 'question'] that will be passed to `answer_prompt`

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
answer_prompt = ChatPromptTemplate.from_template(answer_template()) # 3 variables are expected ['chat_history', 'context', 'question']

answer_prompt_variables = {
"context": lambda x: _combine_documents(docs=x["docs"],document_prompt=DEFAULT_DOCUMENT_PROMPT),
"question": itemgetter("question"),
"chat_history": itemgetter("chat_history") # get chat_history from `loaded_memory` variable
}

llm = instantiate_LLM(
LLM_provider="Google",api_key=google_api_key,temperature=0.5,
model_name="gemini-pro"
)

# 3. Load memory, format `answer_prompt` with variables (context, question and chat_history) and pass the `answer_prompt to LLM.
# return answer, docs and standalone_question

chain_answer = {
"answer": loaded_memory | answer_prompt_variables | answer_prompt | llm,
"docs": lambda x: [
Document(page_content=doc.page_content,metadata=doc.metadata) # return only page_content and metadata
for doc in x["docs"]
],
"standalone_question": lambda x:x["question"] # return standalone_question
}

To create the conversational retrieval chain, let’s chain the two chains: chain_question and chain_answer.

conversational_retriever_chain = chain_question | retrieved_documents | chain_answer

Let’s invoke this chain:

follow_up_question = "plaese give more details about it, including its use cases and implementation."

response = conversational_retriever_chain.invoke({"question":follow_up_question})
Markdown(response['answer'].content)

Here is the answer:

Diffuse to Choose (DTC) is a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. It is used for Virtual Try-All (Vit-All), which allows users to virtually visualize products in their settings…

If we check the memory, we will see that it does not update automatically. We need to manually save the follow-up question and the AI response as follows:

memory.save_context( 
{"question": follow_up_question},
{"answer": response['answer'].content}
)

We have successfully created our RAG system.

Streamlit application

We leveraged our RAG system to build a Streamlit application that enables document chat. You can find the application in this GitHub repository.

Here is a screenshot of the app:

Screenshot of the author’s Streamlit application | Image by Author

In the sidebar, you can select the LLM provider, choose an LLM, adjust its parameters, and insert your API keys.

In the main panel, you can create or load a Chroma vector store and display or clear the chat messages.

Conclusion

In this article, we covered all the steps of creating a RAG chatbot in Langchain, from loading documents to creating a conversational retrieval chain. Additionally, we developed a Streamlit application.

Our RAG chatbot was powered by OpenAI, Google Generative AI and Hugging Face APIs. An alternative option is to run open-source quantized models locally to protect privacy and avoid inference fees. These models can be run with frameworks such as llama.cpp and Ollama.

You can find the code used in this article in this GitHub Repo:

--

--

Ala Eddine GRINE
The Deep Hub

I like building machine learning and LLM-based applications.