Build Scalable Custom GenAI Bots: Retrieval Augmented Generation + Langchain on SageMaker MMEs

13 min readSep 5, 2023

Purpose

With the launch of generative AI services and products, it is tough to get up to pace with building your own solution, or asking yourself — “what is the right way to do this ~job~?” Turns out, there are many right ways — you just have to select the one that is the most optimal for your specific use case.

The pure unleashed wave of AI has brought upon us the future of Large Language Models. For those who don’t know, these are powerful tools that generate human like text in scenarios you want them to, but they have limitations, such as the memory compute of the model, the latency, and the amount of data it can help with. For the sake of this blog, we will focus on building a large language model using your own customized dataset using principles of Retrieval Augmented Generation (RAG) and Langchain, using an embedding model and a Large Language Model to converse and generate text in the specific scenario (in this case, we will use the LLaMa-2–7b model — https://huggingface.co/meta-llama/Llama-2-7b.

When generating our own large language model from the scratch, we can use several techniques, such as, using the KNN algorithm for embeddings, or training out model from the scratch by splitting our data into the traditional testing and validation datasets that works perfectly. The future of large language models depends on creating a vector store and embeddings of chunks of your data, and so in this blog we will be doing a product strategy for a large language model with an implementation of a code walkthrough from an end to end perspective, that you can use in your own environments to create a customized large language model of your own choice.

Before reading further, I will not be explaining the basics of Machine Learning and AI, so be sure to know the basics about the machine learning lifecycle, a little about what generative AI is, and maybe try out a large language model, like GPT, and then try creating your own.

We will be walking through a product strategy of a finance business that wants to create an LLM to be able to respond to finances and data that is commonly asked for by executives and other employees in the company, along with another large language model specifically for customers in the space that interact closely with the staff. Now let’s get started.

Take a look: https://github.com/madhurprash/SageMaker-ML-Projects/tree/main

Note: I work at AWS, but these thoughts on these blogs are my own.

Retrieval Augmented Generation (RAG) in LLMs

Before we get started, here is the machine learning lifecycle for tradition model training, deployment and precess for inferences. Here we focus on understanding the core use of the product, our business idea and pain points, followed by what we need to accomplish. Based on our goals, we will take up a datasource, aggregate a whole bunch of data, and them make sure whatever we are training the model on in terms of data is cleaned and normalized, and we may use several techniques and machine learning algorithms for these.

Followed by this, we will make sure to deploy our model that we decide to use, and then train the model on the data, evaluate the model using metrics (check my previous blog posts for more information on these metrics for evaluations) and feature engineering. We repeat the cycle and make sure our model is trained on the data we need and make sure we evaluate and iterate on the model so that the responses or prompt completions that the model outputs is accurate, concise, and fits the need of the customer and the business/product.

Sometimes, parsing and splitting the data, and training the model on it is not enough to be able to generate outputs that are accurate and sufficient for the users. This is where Retrieval Augmented Generation (RAG) fits in.

LLMs are models that are trained on billions and billions of parameters, making it extremely essential for the responses of the model to be evaluated when we are using either a pre trained model or fully fine tuning a model. Retrieval Augmented Generation (RAG) makes it easier in improving the quality of these LLM-generated completions by making sure that the model is grounded on external resources as sources of knowledge or training sets to make sure how the LLM represents the information that is being supplied and being asked from it. Using Retrieval Augmented Generation (RAG) makes sure that the model is using external, trusted reliable data that can be easily changed in real time to make sure that our LLM provides trust worthy responses in real time. It solves two main problems: the problems for an LLM to have ‘no source’ or the LLM being ‘out of date’.

Here, a large language model focuses on going over the prompt given, we can not only get a completion or a response to your prompt, but get sources from which the LLM gives out responses. Here, RAG, helps augmenting our datasourceswith new information, so that when a user asks a question, the model is ready.

Two more main pointers, Retrieval Augmented Generation (RAG) can help in only referring to the primary sources of information in answering or giving a response to a prompt which avoids hallucinations, or secondly, it can aid in telling the user ‘I don’t know’ when it does not know the answer and it is not provided in any of the augmented sources within. This makes the model is more reliable. Now, we can keep going on and on talking about RAG, but let’s look at a quick snippet:

query = "Which is the fourth country to land on moon?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')

Here, we can see how we are using the query to create a chain and then:

qa_chain = RetrievalQA.from_chain_type(
    llm, 
    chain_type = 'stuff',
    retriever=db.as_retriever(), 
    return_source_documents=True, 
    chain_type_kwargs={"prompt":PROMPT}
)

retrieve from the chain type. We are making sure that the LLM that we are using refers correctly to the documents that it takes in as context (that we can augment and make sure is updated for the model to stay up to date), and gives out the response, along with the sources that it referred to, making sure that the model is reliable, up to date, and purely is based on the ground truth/reality for the source of data that it refers to to generate completions to prompts.

Now, let’s take a look at what Langchain is, how it fits in with Retrieval Augmented Generation (RAG), and how we can use these both with efficient embedding models and large language models in acting as a customized LLM for your business.

Langchain + Huggingface: Simplifying the Creation of Large Language Models (LLMs)

Langchain is a framework that aids in creating LLM-based applications and conversational systems in an efficient and structured manner. It is important to understand the framework if you want a uniform approach in standardizing the creating of LLM implementation in applications. It is important to understand that langchain, in combination with huggingface — a Github for models, that provides over 120k models on its page, can lead to the development of an unlimited use case LLMs. We can focus on standardizing the implementation of LLMs with langchain in combination with models from hugging face which might have several use cases such as:

If your use case involves document processing and analysis
If your use cases involves using langchain to build chatbots to interact with users naturally and in a human centric manner.
Langchain with new models, such as CodeLLaMa could also aid in writing end to end code for implementing your software application.

There are several other use cases such as text generation, summarization, augmentation of data, but for our use case, we will be looking and doing a walkthrough of an end to end Question Answering (QA) LLM to help our product be personalized using a number of data sources that we will retrieve information from in an augmented manner using Retrieval Augmented Generation (RAG).

Before moving forward, make sure you have installed and set up your environment with langchain:

!pip install langchain

from langchain.llms import Huggingfacemodel
llm = model(temperature=0.9,openai_api_key=api_key)## Then, you can use langchain to create a chain and Retrieval
## Augmented Generation (RAG) to fetch and respond using the 
## relevant documentsresponse=llm.predict("How to cook Saag Paneer?")
print(response)

Now, we can conclude that we will use Langchain in combination with Retrieval Augmented Generation (RAG) in the creation of a customized QA Bot using an embedding model and a Large Language Model — LLaMa-2–7b — We will talk about this as we do the code walkthrough.

Code Walkthrough + Product Implementation

STEP 1: INSTALL THE ENVIRONMENT
%pip install langchain==0.0.251 —quiet —root-user-action=ignore %pip install faiss-cpu==1.7.4 —quiet —root-user-action=ignore %pip install pypdf==3.15.1 —quiet —root-user-action=ignore
Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages. Note: you may need to restart the kernel to use updated packages.
Here, make sure to set up the environment, with the use of langchain, FAISS and pypdf
1. Langchain: Opensource framework to build LLM powered applications
2. FAISS: "Facebook AI Similarity Search) is a library that allows developers to quickly search for embeddings of multimedia documents that are similar to each other. It solves limitations of traditional query search engines that are optimized for hash-based searches, and provides more scalable similarity search functions."
3. PyPDF: PDF loader library for our environment to process our data in the form of PDFs
STEP 2: FETCHING AND PROCESSING YOUR CUSTOMIZED DATA (PDF files in this blog)
## Here, we have our file names that you will be creating and storing in your directory in whatever environment that you are using.
filenames = [ 'OurBlogDataForInference.pdf', ]
## Make sure your data is located in this location data_root = "./data/"
filenames = [ 'OurBlogDataForInference.pdf', ]
data_root = "./data/"
## Importing and installing the libraryes, to load the PDF and iteratively ## split the characters to store in our vector database
import numpy as np from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain.document_loaders import PyPDFLoader
documents = []
## Iterating through all of the files in the filenames (in case you have more than ## one file - this makes this approach scalable.
for filename in filenames:
## Uploading the PDFs loader = PyPDFLoader(data_root + filename)
loaded_documents = loader.load() # Use a variable to store loaded documents
documents.extend(loaded_documents) # Extend the list with loaded documents
text_splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=100, )
docs = text_splitter.split_documents(documents) print(f'Number of Document Pages: {len(documents)}') print(f'Number of Document Chunks: {len(docs)}')
Output: Number of Document Pages: 28 Number of Document Chunks: 170
Now, that we have processed the document or data, let's work with the model to embed the documents in vector stores to be able to use RAG to get the contextually correct blog data related documents

Deploying a Model for Embedding: All MiniLML6 v2 and the LLaMa-2–7b-chat for our LLM

!pip install -qU \
sagemaker \
    pinecone-client==2.2.1 \
    ipywidgets==7.0.0

## To begin, we will initialize all of the SageMaker session variables we'll 
## need to use throughout the walkthrough.
import sagemaker## Importing the sagemaker jumpstart model for the LLM that we will be using
from sagemaker.jumpstart.model import JumpStartModel## Importing Huggingface for a model that we will use to first, create embeddings
## for the data we have loaded, and secondly, for the LLM to converse with users
from sagemaker.huggingface import HuggingFaceModel
role = sagemaker.get_execution_role()
my_model = JumpStartModel(model_id = "meta-textgeneration-llama-2-7b-f")

Deploying the model endpoint for Sentence Transformer embedding model

First up, we will deploy an embeddings model helps and makes it easier to do machine learning on large and sparse data. Embeddings makes it easier for the model to make comparisons and analysis, and using a model for just this will help us use our large amounts of scalable data to have embeddings created to be able to have more contextually aligned answers, since the data will be stored and referred to as embeddings with different vectors.

from sagemaker.jumpstart.model import JumpStartModel
embedding_model_id, embedding_model_version = "huggingface-textembedding-all-MiniLM-L6-v2", "*"
model = JumpStartModel(model_id=embedding_model_id, model_version=embedding_model_version)
embedding_predictor = model.deploy()

--------!

embedding_model_endpoint_name = embedding_predictor.endpoint_name
embedding_model_endpoint_name

import boto3
aws_region = boto3.Session().region_name
print(aws_region)

us-east-1

Creating and Populating our Vector Database

In our case, since we will be using large amounts of data and need to be able to match all of the contextually aligned chunks of data we processed to the question asked by the user, we can use a vector database, or ‘FAISS’, to have the chunks represented as vectors, so that when the user prompts, the prompt is stored as an embedding with a vector that is matched with the chunks of data, returning the most viable and contextually aligned and accurate answers.

from typing import Dict, List
from langchain.embeddings import SagemakerEndpointEmbeddings
from langchain.embeddings.sagemaker_endpoint import EmbeddingsContentHandler
import json

## Represents a custom embeddings content handler to handle the prompts and outputs
class CustomEmbeddingsContentHandler(EmbeddingsContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
   
    def transform_input(self, inputs: list[str], model_kwargs: Dict) -> bytes:
        input_str = json.dumps({"text_inputs": inputs, **model_kwargs})
        return input_str.encode("utf-8")
    
    def transform_output(self, output: bytes) -> List[List[float]]:
        response_json = json.loads(output.read().decode("utf-8"))
        embeddings = response_json.get("embedding", [])  # Use get() with a default value
        return embeddings  # Make sure to return the embeddings
    embeddings_content_handler = CustomEmbeddingsContentHandler()## Creating an object for the embedddings to be able to invoke easily
embeddings = SagemakerEndpointEmbeddings(
    endpoint_name= embedding_model_endpoint_name,
    region_name=aws_region,
    content_handler=embeddings_content_handler,
)

Now, with our embeddings, we can process our document chunks into vectors and actually store them somewhere. Our project will use the:

FAISS: In-Memory vector database

from langchain.schema import Document

from langchain.vectorstores import FAISS

## Now, we store all the docs and embeddings in our database

db = FAISS.from_documents(docs, embeddings)

NOW, RUNNING VECTOR QUERIES!!

query = "What is the financial status of product XYZ?"

results_with_scores = db.similarity_search_with_score(query)
for doc, score in results_with_scores:
print(f"Content: {doc.page_content}\nScore {score}\n\n")

Output :///////////////////////// Score 0.7269995808601379, score: — — -

Now, as we run these vector queries, we will get responses that are not the most accurate and for this we will use RAG and langchain to get appropriate and accurate responses as given below:

PROMPT ENGINEERING FOR CUSTOM DATA

from langchain.prompts import PromptTemplate
prompt_template = """
<s>[INST] <<SYS>>
Use the context provided below to answer the question at the end. If you don't know the answer, please state that you don't know and do not attempt to make up an answer.
<</SYS>>
Context:
----------------
{context}
----------------
Question: {question} [/INST]
"""
PROMPT = PromptTemplate(
template = prompt_template,
input_variables=["context", "question"]
)

Now that we have defined what our prompt template is going to look like, we will create and prepare our LLM

PREPARING OUR CUSTOM LLM

from typing import Dict

from langchain import SagemakerEndpoint, PromptTemplate
from langchain.llms.sagemaker_endpoint import LLMContentHandler
from langchain.chains.question_answering import load_qa_chain
from langchain.chains import RetrievalQA
import jsonclass QAContentHandler(LLMContentHandler):
    content_type = "application/json"
    accepts = "application/json"
    
    def transform_input(self, prompt: str, model_kwargs: dict) -> bytes:
        input_str = json.dumps(
            {"inputs" : [
                [
                    {
                        "role": "system", 
                        "content": ""
                    },
                    {
                        "role": "user", 
                        "content": prompt
                    }
                ]], 
             "parameters": {**model_kwargs}
            })
        return input_str.encode('utf-8')
    
    def transform_output(self, output: bytes) -> str:
        response_json = json.loads(output.read().decode("utf-8"))
        return response_json[0]["generation"]["content"]
    
qa_content_handler = QAContentHandler()

Now that we have our content handler, we will deploy a sagemaker endpoint for our Large Language Model that will work with the embedding model to generate outputs.

LLaMa-2–7b-f LLM for our CUSTOM DATASET

# from sagemaker.jumpstart.model import JumpStartModel
llm_model_id, llm_model_version = "meta-textgeneration-llama-2-7b-f", "*"
llm_model = JumpStartModel(model_id=llm_model_id, model_version=llm_model_version)
llm_predictor = llm_model.deploy(
initial_instance_count=1, instance_type="ml.g5.4xlarge")

---------------!llm_model_endpoint_name = llm_predictor.endpoint_name
llm_model_endpoint_name## Creating the LLM object for easy invocations from the model endpoint
## to generate inference using the RAG documentsllm = SagemakerEndpoint(
    endpoint_name=llm_model_endpoint_name, 
    region_name=aws_region, 
    model_kwargs={"max_new_tokens": 1000, "top_p":0.9, "temperature": 1e-11}, 
    endpoint_kwargs={"CustomAttributes": "accept_eula=true"},
    content_handler=qa_content_handler
)

Now, we can use our ‘llm’ object to query and make predictions on our dataset

query = "Hello"
llm.predict(query)

" Hello! It's nice to meet you. Is there something 
I can help you with or would you like to chat?"query = "What is Financial Status of Product XYZ?"
llm.predict(query)" Hello! It's nice to meet you. The financial status of product
XYZ is economically stable and steady with a little bit of an 
exponential growth the last month compared to the past 3 years."

Not a bad answer, but we will create an Langchain CHAIN using the RetrievalQA chain which will:

Take a query as input
Generate query embeddings
Query the vector database for revelant chunks from the knowledge you supply
Inject the context and original query in the Prompt Template
Invoke the LLM with a completed prompt and
Successfuly get the LLM Response/Completion:

qa_chain = RetrievalQA.from_chain_type(
    llm, 
    chain_type = 'stuff',
    retriever=db.as_retriever(), 
    return_source_documents=True, 
    chain_type_kwargs={"prompt":PROMPT}
)

Now that our chain has been created, we can supply queries to it and generate responses based on our source documents

query = "What is Financial Status of Product XYZ?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')

response: ……

query = "What are the number of customers acquired last month?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')

response: ……..

query = "What are the risks of launching this Product in India?"
result = qa_chain({"query": query})

print(f'Query: {result["query"]}\n')
print(f'Result: {result["result"]}\n')
print(f'Context Documents: ')
for srcdoc in result["source_documents"]:
    print(f'{srcdoc}\n')
    reponse: “I am sorry, I don’t know about this information based 
on the context provided”This is important and shows that using RAG and Langchain, the model will ONLY provide answers based on the data that you provide it and the context you tell it to rely on. If you clarify it to the model in the prompt to NOT answer if it DOES NOT KNOW the answer, it will not make up an answer.

CLEAN UP YOUR ENDPOINT!

Make sure to clean up your endpoint to stop it from incurring charges!

# sagemaker_client = boto3.client('sagemaker', region_name=aws_region) # sagemaker_client.delete_endpoint(EndpointName=embedding_model_endpoint_name) # sagemaker_client.delete_endpoint(EndpointName=llm_model_endpoint_name)

Conclusion

From the above code, you can successfully deploy your own LLM powered application trained on a customized dataset using Retrieval Augmented Generation (RAG) and Langchain and hugging face. This leads to highly reliable, augmented, real time and accurate responses for the users. Thank you for reading, and the next blog will dive deeper into some of these concepts and applications. It is important to understand that if your LLM performs several different tasks, and if your data goes beyond being manually processed, you can create a serial inference pipeline on sagemaker where each container performs different tasks, such as data preprocessing, model deployment and inference and data postprocessing and so on. You can deploy several models for several different use cases on SageMaker Multi-Model endpoints in an auto-scalable manner.