(Part 2) Build a Conversational RAG with Mistral-7B and LangChain

Madhav Thaker
11 min readJan 3, 2024

--

DALL-E generated image of a young man having a conversation with a fantasy football assistant

Before diving into the advanced aspects of building Retrieval-Augmented Generation (RAG) applications with LangChain, it is crucial to first explore the foundational groundwork laid out in Part 1 of this series. If you haven’t read it yet, I strongly recommend starting there.

Part 1 provides an essential overview of the limitations in existing Large Language Models (LLMs) and how RAGs can effectively address these challenges. It also guides you through the process of constructing a basic RAG application, which is a vital step in grasping the core concepts of RAG technology.

While Part 1 was instrumental in building your initial understanding, it’s important to recognize that it was just a preliminary step. Real-world applications of RAG technology demand a more nuanced approach. In this article, we will explore the complexities that come with refining a RAG application to provide a smooth, conversational user experience.

Let’s first walk through an example of how things fall short. Let’s start a new conversation:

# First question in my chat
rag_chain.invoke("How is Mahomes doing?")
According to the provided context, Patrick Mahomes' performance is causing 
panic among fantasy football owners. The article mentions that he has failed
to reach 12 PPR points in five of his last six games and has been under
10 points in four of those games. As a result, his panic meter grade is 3,
which indicates that he is officially panicked and active seekers of a trade
while he still has value.

This is a great start. Mahomes doesn’t seem to be doing well so maybe we should look for an alternative for our team. Let’s ask a follow up question:

rag_chain = ( 
{"context": retriever, "question": RunnablePassthrough()}
| llm_chain
)

rag_chain.invoke("Who are some good alternatives to him?")
Based on the information provided, some potential alternatives to Austin Ekeler 
as a flex option could be Ezekiel Elliot, James Conner, and Stefon Diggs.
These players have shown better performance in recent games compared to Ekeler
and may provide more value as a flex option.

Well, that response isn’t relevant at all. As you can see, the RAG doesn’t know who him is referring to so it just picks a male individual in our vector database and then answers the question. Ideally, we need the RAG application to answer: “Who are good alternative to Patrick Mahomes?”

This article will dive into how we can do this! We will learn:

  • How to store the conversation history in memory and include it within our prompt.
  • How to transform the input question such that it retrieves the relevant information from our vector database.

Conversational RAG Architecture

Before we go further, it’s important to note that there are abstracted classes available which can simplify a lot of this work for us. Here are a couple examples:

However, having a solid understanding of what’s happening ‘under the hood’ is crucial which is why this tutorial will leverage low level LangChain components. Working at this level will let you understand why you may or may not be getting the result you are expecting and ultimately have more control over your RAG application.

Let’s first recap the RAG architecture we discussed in Part 1. Again, if you haven’t already, I suggest you read that first since it walks through how this works in more detail.

High Level RAG Architecture

In the highlighted section, we pass in the query as is but really we need to pass a transformed version that can appropriately query our vector database.

Input query transformation

Here is our updated architecture diagram.

High Level Conversational RAG Architecture

Let’s review the key updates:

  1. We now save the conversation history to memory and leverage it to generate a standalone question.
  2. We add a second LLM which will be responsible for generating a standalone question that can appropriately query the vector database.

How do we build it?

Similar to my original article, I’m going to use an article from fantasypros.com and ask the LLM questions that it would only be able to answer if it has access to that data.

Create both LLM pipelines

We will load the model the exact same way as we did in the original article. We’re using the following model/machine to build this:

  • Model: mistralai/Mistral-7B-Instruct-v0.2
  • Machine: 1 Nvidia L4 GPU
from transformers import pipeline
from langchain.llms import HuggingFacePipeline

standalone_query_generation_pipeline = pipeline(
model=mistral_model,
tokenizer=tokenizer,
task="text-generation",
temperature=0.0,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=1000,
)
standalone_query_generation_llm = HuggingFacePipeline(pipeline=standalone_query_generation_pipeline)

response_generation_pipeline = pipeline(
model=mistral_model,
tokenizer=tokenizer,
task="text-generation",
temperature=0.2,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=1000,
)
response_generation_llm = HuggingFacePipeline(pipeline=response_generation_pipeline)

So whats the difference between the two? The standalone_query_generation_pipeline uses a temperature of 0.0 instead of 0.2 for our response_generation_pipeline. I do this because I want to make sure there is as little chance for hallucinations when generating the standalone query since that impacts the application’s ability to retrieve relevant context.

For those unfamiliar with temperature, the excerpt highlights why 0.0 makes sense for our standalone_query_generation_pipeline . More information on temperature and other LLM parameters can be found here.

Temperature is a close second to prompt engineering when it comes to controlling the output of the Generate model. It determines how creative the model should be.

A Temperature of 0 makes the model deterministic. It limits the model to use the word with the highest probability. You can run it over and over and get the same output. As you increase the Temperature, the limit softens, allowing it to use words with lower and lower probabilities…

Now, in this example, I’ve just used the same Mistral model with different temperate settings but you really could/should use different LLMs altogether for both. For example, you could leverage (or build) a fine-tuned model that is optimized for the standalone query generate task.

Standalone Questions Generation Chain

I decided to use a few shot prompt engineering approach to help guide the LLM.

7B models are performant but they’re not perfect so providing a handful of examples in the prompt is a good idea. Take a look at how we do this.

from langchain.prompts.prompt import PromptTemplate
from langchain_core.prompts.chat import ChatPromptTemplate
_template = """
[INST]
Given the following conversation and a follow up question,
rephrase the follow up question to be a standalone question, in its original language,
that can be used to query a FAISS index. This query will be used to retrieve documents with additional context.

Let me share a couple examples.

If you do not see any chat history, you MUST return the "Follow Up Input" as is:
```
Chat History:
Follow Up Input: How is Lawrence doing?
Standalone Question:
How is Lawrence doing?
```

If this is the second question onwards, you should properly rephrase the question like this:
```
Chat History:
Human: How is Lawrence doing?
AI:
Lawrence is injured and out for the season.
Follow Up Input: What was his injury?
Standalone Question:
What was Lawrence's injury?
```

Now, with those examples, here is the actual chat history and input question.
Chat History:
{chat_history}
Follow Up Input: {question}
Standalone question:
[your response here]
[/INST]
"""

STANDALONE_QUESTION_PROMPT = PromptTemplate.from_template(_template)

Now that we have a prompt template, let’s create a chain to populate the prompt with the necessary pieces

from langchain.schema import format_document
from langchain_core.messages import AIMessage, HumanMessage, get_buffer_string
from langchain_core.runnables import RunnableParallel
from langchain_core.runnables import RunnableLambda, RunnablePassthrough

# Instantiate ConversationBufferMemory
memory = ConversationBufferMemory(
return_messages=True, output_key="answer", input_key="question"
)
# First, load the memory to access chat history
loaded_memory = RunnablePassthrough.assign(
chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)
# Define the standalone_question step to process the question and chat history
standalone_question = {
"standalone_question": {
"question": lambda x: x["question"],
"chat_history": lambda x: get_buffer_string(x["chat_history"]),
}
| STANDALONE_QUESTION_PROMPT,
}
# Finally, output the result of the CONDENSE_QUESTION_PROMPT
output_prompt = {
"standalone_question_prompt_result": itemgetter("standalone_question"),
}
# Combine the steps into a final chain
standalone_query_generation_prompt = loaded_memory | standalone_question | output_prompt

The most important step here is being able to store the conversation history. Fortunately, LangChain makes it easy for us to do that.

from langchain.memory import ConversationBufferMemory

# Instantiate ConversationBufferMemory
memory = ConversationBufferMemory(
return_messages=True, output_key="answer", input_key="question"
)
# First, load the memory to access chat history
loaded_memory = RunnablePassthrough.assign(
chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)

So what’s happening here:

  1. The ConversationBufferMemory class is instantiated with parameters to return messages, specifying answer as the output key and question as the input key, which sets up a memory buffer to manage and track the conversation's questions and answers.
  2. The loaded_memory variable uses a RunnablePassthrough and RunnableLambda to load and access the chat history from the memory, specifically retrieving the history attribute, which contains the conversation's past interactions for reference and context management.

Let’s try this out. I’m going to save some random question and answer to illustrate how the prompt gets populated.

inputs = {"question": "how is mahomes doing?"}
memory.save_context(inputs, {"answer": "mahomes is not looking great! bench him!"})

NOTE: This saved context is what gets loaded when we run RunnableLambda(memory.load_memory_variables) .

Now, if we invoke the chain:

inputs = {"question": "who should I replace him with?"}
standalone_query_generation_prompt.invoke(inputs)['standalone_question_prompt_result']
...

Now, with those examples, here is the actual chat history and input question.

Chat History:
Human: how is mahomes doing?
AI: mahomes is not looking great! bench him!

Follow Up Input: who should I replace him with?
Standalone question:
[your response here]
[/INST]

Great, the prompt is populated with our conversation history. Now, we just need to add a link to the standalone_question chain which adds the standalone_query_generation_llm model and then we should generate an updated question.

standalone_query_generation_chain = (
loaded_memory
| {
"question": lambda x: x["question"],
"chat_history": lambda x: get_buffer_string(x["chat_history"]),
}
| STANDALONE_QUESTION_PROMPT
| standalone_query_generation_llm,
)

inputs = {"question": "who should I replace him with?"}
standalone_query_generation_chain.invoke(inputs)
{'standalone_question': 'Standalone Question: Who should I replace Mahomes with?'}

Perfect, now we have a query that will return relevant results.

Complete Chain

Fortunately, the hard part is done! We now need to take the output of the Standalone Query Generation chain and query our vector database to retrieve relevant context.

template = """
[INST]
Answer the question based only on the following context:
{context}

Question: {standalone_question}
[/INST]
"""

RESPONSE_PROMPT = ChatPromptTemplate.from_template(template)

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
def _combine_documents(
docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
doc_strings = [format_document(doc, document_prompt) for doc in docs]
return document_separator.join(doc_strings)

# First we add a step to load memory
# This adds a "memory" key to the input object
loaded_memory = RunnablePassthrough.assign(
chat_history=RunnableLambda(memory.load_memory_variables) | itemgetter("history"),
)

# Now we calculate the standalone question
standalone_question = {
"standalone_question": {
"question": lambda x: x["question"],
"chat_history": lambda x: get_buffer_string(x["chat_history"]),
}
| CONDENSE_QUESTION_PROMPT
| standalone_query_generation_llm,
}
# Now we retrieve the documents
retrieved_documents = {
"docs": itemgetter("standalone_question") | retriever,
"standalone_question": lambda x: x["standalone_question"],
}
# Now we construct the inputs for the final prompt
final_inputs = {
"context": lambda x: _combine_documents(x["docs"]),
"standalone_question": itemgetter("standalone_question"),
}
# And finally, we do the part that returns the answers
answer = {
"answer": final_inputs | ANSWER_PROMPT | response_generation_llm,
"standalone_question": itemgetter("standalone_question"),
"context": final_inputs["context"]
}
# And now we put it all together!
final_chain = loaded_memory | standalone_question | retrieved_documents | answer

Let’s break down the added components:

# Now we retrieve the documents
retrieved_documents = {
"docs": itemgetter("standalone_question") | retriever,
"standalone_question": lambda x: x["standalone_question"],
}

This takes the standalone question we generated and queries the vector database. If this looked familiar its because it should! This is essentially the same chain we built in the original article.

This time around we we will combine the documents into a single string that can be inputted into our prompt.

DEFAULT_DOCUMENT_PROMPT = PromptTemplate.from_template(template="{page_content}")
def _combine_documents(
docs, document_prompt=DEFAULT_DOCUMENT_PROMPT, document_separator="\n\n"
):
doc_strings = [format_document(doc, document_prompt) for doc in docs]
return document_separator.join(doc_strings)
...

final_inputs = {
"context": lambda x: _combine_documents(x["docs"]),
"standalone_question": itemgetter("standalone_question"),
}

Now, this isn’t necessary but it is a best practice. This allows us to:

  • Include additional clean up of the input documents.
  • Process and summarize retrieved documents. This can come in handy to avoid overly verbose prompt strings.

Now we have a string version of the retrieved documents and standalone question ready for the response generation LLM to provide a final response to the user. Here is the chain that generates the final response.

"answer": final_inputs | ANSWER_PROMPT | mistral_llm

To polish this up a bit more, I also include the standalone question and context into the output dictionary. This is useful information to have in a RAG application.

# And finally, we do the part that returns the answers
answer = {
"answer": final_inputs | ANSWER_PROMPT | mistral_llm,
"standalone_question": itemgetter("standalone_question") ,
"context": final_inputs["context"],
}

Examples

inputs = {"question": "How is Mahomes doing?"}
result = final_chain.invoke(inputs)
result
Based on the provided context, the Fantasy Football Panic Meter grades 
Patrick Mahomes and Austin Ekeler as level 3, indicating that they are
officially panicked and their value as trustworthy starters has decreased.

Alright, now let’s ask a question about how he is doing.

# Save previous question and answer to memory 
memory.save_context(inputs, {"answer": result["answer"]})

inputs = {"input_question": "Who are good alternatives to him right now?"}
result = final_chain.invoke(inputs)
result
{'standalone_question': 'Standalone Question: Who are some good alternatives 
to Mahomes at quarterback right now?'
'answer': 'Based on the provided context, some good alternatives to Mahomes
at quarterback right now include Baker Mayfield and Joe Flacco.'}

We see that “Who are good alternatives to Mahomes right now?” was translated to “Who are some good alternatives to Mahomes as quarterback right now?” which is what we want! Now that question was used to retrieve relevant documents from our vector database.

So now we know that Baker Mayfield and Joe Flacco are good alternatives. Let’s keep the conversation going and ask a follow up.

memory.save_context(inputs, {"answer": result["answer"]})

inputs = {"input_question": "How many PPG are both averaging?"}
result = final_chain.invoke(inputs)
result
{'standalone_question': 'Standalone Question: What is the average points per 
game (PPG) for both Baker Mayfield and Joe Flacco?',
'answer': 'The average points per game (PPG) for Baker Mayfield is 22.9 over
his last three games. The average points per game (PPG) for
Joe Flacco is 351 pass YPG and 20.6 PPG over his last
three games.'}

Again, “How many PPG are both averaging?” was translated to “What is the average points per game (PPG) for both Baker Mayfield and Joe Flacco?”. This standalone question LLM was able to understand who I meant when I said “both”. And with that, we have a RAG that you can naturally converse with!

To recap, we further explored the practicalities of creating a conversational RAG. We learned the importance of maintaining conversation history and transforming input questions into standalone queries that effectively retrieve relevant information from vector databases. The use of two distinct LLMs, one for generating standalone queries and the other for generating responses, demonstrated a significant improvement in contextual understanding and relevance of the responses.

The step-by-step guide to building a conversational RAG highlighted the power and flexibility of LangChain in managing conversation flows and memory, as well as the effectiveness of Mistral in generating precise queries and responses. This approach ensures that each query is contextually relevant to the ongoing conversation, leading to more accurate and useful answers.

I hope you enjoyed this — let me know if you have any questions or suggestions! Please reach out to me via LinkedIn!

You can find the end to end code in this notebook.

--

--