DEMO: Langchain + RAG Demo on LlaMa-2–7b + Embeddings Model using Chainlit

9 min readSep 16, 2023

Purpose

The purpose of this blog post is to go over how you can utilize a Llama-2–7b model as a large language model, along with an embeddings model to be able to create a custom generative AI bot trained on your data for your use case. I will go over some of the main concepts, code walkthrough, and then what I don’t like about this approach and how we can make this more optimal, respond in low latency, and not take CPU power from your device.

It is essential to understand that this post focuses on using Retrieval Augmented Generation, Langchain, the power and the scope of the LlaMa-2–7b model and how we can focus on utilizing an embedding model to transform sentences into embeddings. This is not an optimal approach, but we are going to focus on utilizing CHAINLIT for the interface, and mostly python as our driving stack to build a chatbot from the scratch with low code as possible.

Note: I work at Amazon Web Services, but the thoughts and opinions on these blogs are my own.

Some of the prerequisites for following across this walkthrough is to understand machine learning, concepts of data analytics within artificial intelligence, understanding Retrieval Augmented Generation and the use of Langchain as our driving source to get our solution to be generative.

For following across this blog, feel free to check out other blogs of mine to stay up to date with some of these important concepts:

https://medium.com/@madhur.prashant7/build-scalable-custom-genai-bots-retrieval-augmented-generation-langchain-on-sagemaker-mmes-489562c7a47b

https://medium.com/@madhur.prashant7/model-adaptation-fine-tuning-optimal-generative-ai-model-deployment-part-1-a7075e7ebfff

Once you have gone through these feel free to follow across this blog — it is going to be more of a code walkthrough and less of talking about some of the main concepts in depth.

First, install the necessary langchain libraries below to be able to process your data:

from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain import PromptTemplate
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.llms import CTransformers
from langchain.chains import RetrievalQA
import chainlit as cl

Note that here, we are importing a PDF as a source of our data, and we can keep talking about how we can use concepts like PEFT (parameter efficient fine tuning) or LoRA, in being able to fine tune our model and that is true but the best part about using RAG and langchain is to be able to upload your raw data, such as texts form PDF or medical data and then train your model on that. This reduces the probability of the model to hallucinate because the model depends on only and only the data you give it and as long as you train it through prompt engineering to “only answer the question based on the {context} above. If the {question} is not in the {context}, just say that you don’t know the answer, don’t try to make up an answer”.

So as you can see above, we will upload a PyPDF loader to be able to load our PDF and directory from the data we give to the model, make sure we import the prompt template as well. Now here, I want to re iterate that this prompt template is essential to use throughout the course of training our model, specifically when we are using RAG and LANCHAIN, our model does not depend specifically on just using fine tuning methods like PEFT or QLoRA, so we need to make sure we pass in prompt templates that suffice answering our questions in a way where users are satisfied.

A lot of start ups are adopting “AI powered” solutions and it is important to not only tune but prompt engineering — sounds funny to be a prompt engineer, but it plays the biggest role in the models in the current generative AI wave. Anyways, let’s move forward:

Now, create a vector store using FAISS to create chunks and clusters for similarity search….

DB_FAISS_PATH = ‘vectorstore/db_faiss’
demo_prompt_template = “””Use the following pieces of information to answer the user’s question.
If you don’t know the answer, just say that you don’t know, don’t try to make up an answer.
Context: {context}
Question: {question}
Only return the helpful answer below and nothing else.
Helpful and Caring answer:
“””

Now, you will have to have a folder in your IDE, with whatever you are using (I used VSCode for my demo, but feel free to create a folder that our model.py can refer to to give response based on clustering and chunking the data that we provide and therefore, get responses based on that. It is important to understand, and it is given from my previous blogs to be able to cluster and chunk the data into smaller pieces for the model to use that context to do a “Similarity search” on the relevant documents that it extracts based on the query and then look for the best possible answer or prompt completion in those shortlisted chunks/relevant documents, and this is why having a FAISS vector store is ESSENTIAL to have. Make sure to add a folder with relevant contents in it.

Now, you give it the context above right, just like we mentioned earlier, and giving the model this context, it will be able to give responses referring to the template.

Most of the models are already trained against bias, harmful content, illegal content, but it is important to “reinforce” the function of the model when you are using it, so having that stated in the prompt makes it act as a helpful, harmless, honest, and non hallucinating assistant.

def custom_prompt():
    """
    Prompt template for QA retrieval for each vectorstore
    """
    prompt = PromptTemplate(template=custom_prompt_template,
                            input_variables=['context', 'question'])
    return prompt

#Retrieval QA Chain
def retrieval_qa_chain(llm, prompt, db):
    qa_chain = RetrievalQA.from_chain_type(llm=llm,
                                       chain_type='stuff',
                                       retriever=db.as_retriever(search_kwargs={'k': 2}),
                                       return_source_documents=True,
                                       chain_type_kwargs={'prompt': prompt}
                                       )
    return qa_chain

Here, make sure to be able to set your personal custom prompt, and then get it to retrieve data based on the vector store. The function retrieval_qa_chain is essential to have because it will make sure to chain the documents and the relevant response types to give you a constructive answer. Given below are two examples from my own github with and without using retrieval qa chain:

Without chaining:

As you can see, it will give you response that is sufficient but it is not chained properly. It does not efficiently chain the response from the relevant documents and for this , we use retrieval qa chain to be able to retrieve the relevant documents and then be able to chain a perfect response based on teh query that is provided. Let’s look at an example of when we use it, and the kind of response we get

With Retrieval QA Chain:

As seen from the document above, it chains the relevant information from the documents that are extracted from the vector store and therefore, gives an optimal answer based on that.

Load the LLM that you are attempting to use and initialize it:

#Loading the model
def load_llm():
    # Load the locally downloaded model here
    llm = CTransformers(
        model = "meta/Llama-2-7B-Chat-GGML",
        model_type="llama",
        max_new_tokens = 512,
        temperature = 0.5
    )
    return llm

Here, we have to be able to load the model that we are using which in this case is the llama 2–7b model from meta. This LLM works super efficiently and pairs well with the embeddings model that we are going to use and show the code.

Make sure to include the parameters and the values of the parameters based on your specific use case. Let’s take an example — if you are running a medical device company that also provides solutions that are medical based, you want to be able to use the information in a clear cut and an accurate fashion and for that, we want to be able to set the temperature of 0 to be able to return a deterministic response that is returns the same completion. In this case, you don’t want to give the users different responses especially in the use case of medicines. On the other hand, if you are using a use case where you can give different completions to the user, you can modify these parameters, such as usual text/content generation.

Creating the embeddings model, using QA_Chaining and returning the final result function:

#QA Model Function
def qa_bot():
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2",
                                       model_kwargs={'device': 'cpu'})
    db = FAISS.load_local(DB_FAISS_PATH, embeddings)
    llm = load_llm()
    qa_prompt = set_custom_prompt()
    qa = retrieval_qa_chain(llm, qa_prompt, db)

    return qa#output function
def final_result(query):
    qa_result = qa_bot()
    response = qa_result({'query': query})
    return response

In these functions, we are firstly focusing on using our embeddings model with our vector store to be able to chunk the documents from sentence to sentence and then get them together based on the relevancy of the prompt that is supplied to the LLM, using the vector FAISS store. We will be using the

sentence-transformers/all-MiniLM-L6-v2

model to pair with to make it easier to create embeddings. Here, it will take a little longer to create embeddings for those models, but it will give a response that is accurate and this model pairs up well with the large language model that we are using. Note that we have talked about using this in one of the previous blogs on amazon Sagemaker Studio.

Lastly, when we receive the chaining, we parse that through the final result function and return that in based on the query that is given to the LLM.

Using Chainlit to launch our application:

#chainlit code
@cl.on_chat_start
async def start():
    chain = qa_bot()
    msg = cl.Message(content="Starting your gen AI bot!...")
    await msg.send()
    msg.content = "Welcome to Demo Bot!. Ask your question here:"
    await msg.update()

    cl.user_session.set("chain", chain)@cl.on_message
async def main(message):
    chain = cl.user_session.get("chain") 
    cb = cl.AsyncLangchainCallbackHandler(
        stream_final_answer=True, answer_prefix_tokens=["FINAL", "ANSWER"]
    )
    cb.answer_reached = True
    res = await chain.acall(message, callbacks=[cb])
    answer = res["result"]
    sources = res["documents"]    if sources:
        answer += f"\nSources:" + str(sources)
    else:
        answer += "\nNo sources found"    await cl.Message(content=answer).send()

Here, is where we put all of our pieces together and create a bot that is launched on chainlit.

“Chainlit is an open source Python / Typescript library that allows developers to create ChatGPT like user interfaces quickly. It allows you to create a chain of thoughts and then add a pre-built, configurable chat user interface to it. It is excellent for web based chatbots.“

Here, you don’t only have to use RAG or Langchain, but if you are looking to build a prototype, you an fine tune your model, containerize it in docker and then launch it easily on streamlit or chainlit without going through the hassle of building your own interface.

INGEST YOUR DATA! Don’t Forget

from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader, DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

DATA_PATH = 'demodataPDFs/'
DB_FAISS_PATH = 'vectorstore/db_faiss'# Create vector database
def create_vector_db():
    loader = DirectoryLoader(DATA_PATH,
                             glob='*.pdf',
                             loader_cls=PyPDFLoader)    documents = loader.load()
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500,
                                                   chunk_overlap=50)
    texts = text_splitter.split_documents(documents)    embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',
                                       model_kwargs={'device': 'cpu'})    db = FAISS.from_documents(texts, embeddings)
    db.save_local(DB_FAISS_PATH)if __name__ == "__main__":
    create_vector_db()

Now, here, as the code suggests, we are ingesting our pdf data that we supply to the code through the folder:DATA_PATH = ‘demodataPDFs/’
DB_FAISS_PATH = ‘vectorstore/db_faiss’

Here, make sure that if you are ingesting your data, you can use this when you are fine tuning your own model too. Once you have using the embeddings model here to be able to create embeddings for the documents that you are supplying and then using the vector FAISS data store to be able to divide your data into chunks, you can run the model:

~~~~~~~chainlit run demomodel.py -w

And your model should launch on chainlit, where you can debug and trouble shoot in case you face errors, they will pop up before the website properly loads.

Conclusion

This was the easiest way to deploy your own customized chatbot using RAG and Langchain for your bot that you can use to test out your prototype. In the next blog, I will plan to host a model that is harmful, and how we can take the toxicity out of it. We will also talk about start ups and how AI powered solutions changing the generative AI wave.

Make sure that you can optimize this model by using different services, and make this entire process way way better, using Amazon Bedrock👀 but we will save that for when it launches and is publicly accessible.

Make sure to follow me on LinkedIn for more: https://www.linkedin.com/in/madhur-prashant-781548179/