Get Insight from your Business Data — Build LLM application with LangChain and Hugging Face using RAG

7 min readAug 13, 2023

The rise of LLM revolutionized the industry and everyone’s thinking about how we can utilize the LLM’s power to support our business and make our customer’s life easier. When we see LLM’s space, we see either these models are proprietary like ChatGPT family of model from OpenAI or if its open source like FLAN-T5 family of model from Google (encoder-decoder model), these are trained on public internet based data.

Question arises how we can utilize LLM’s with our own business data?

Fine Tune vs RAG (Retrieval Augmented Generation)

There are two approaches either we can Fine tune the LLMs with our own data for a specific task (like question-answer, summarization etc) or we can use RAG which provides how to incorporate your business data with the LLMs while executing customer query on the business data. Choosing between RAG and fine-tuning the LLM will depends on various factors.

Fine-tuning is great choice when we have a large amount of task’s specific labeled data and want to get insight of the data. For example we want to summarize agent and customer’s chat in call center for getting better insight of the complex data. We can fine-tune the LLMs with chat history database with labeled summery and then inference the model with real time chat history. Fine tuning can be computationally expensive, time consuming and requires big infra (GPUs and memory) resources. However we can fast track the training and consume less memory using method like PEFT (Parameter Efficient fine-tuning) and deal with computational challenges with techniques like Quantization, Purning etc.

RAG is advantageous when we have a retrieval corpus available, covering relevant information for the task (may be question-answer). It provides a way that customers can have conversation with these document or corpus and get answer to their query from these documents using the LLM. For example, we may have data on corporate wiki, web-sites, pdfs and want to run customer query on these documents using LLMs. RAG is more efficient in terms of resource utilization and provides faster results, making it suitable for applications with limited computation resources, real time requirements or low latency needs.

It’s cheaper to keep retrieval indices up-to-date (RAG) than to continuously pre train an LLM using Fine tune.

RAG

RAG is a framework for building the LLM powered applications that make use of external data sources outside the model and enhances the input with data, providing the richer context to improve output. One of the option for getting answers from LLM from our own data is that we can pass full data as a context window with question as a prompt which we want to query. Problem with this approach is that LLM are constrained with context window as a prompt (4096 tokens in GPT3). However context window are getting larger and larger with new releases of model (32,768 in GPT4) but still we can not pass a full corpus which may be in Gigabytes.

Intuition behind the RAG is that if we can first run customer query with corpus and fetch the relevant specific information in much smaller size and then pass the retrieve information to LLMs in context window with the customer query to get the desired result. This means we need to divide the corpus in multiple chunks and store in a form where we can fetch the relevant chunks based on customer’s query. Best way is to convert the chunks into text embedding and store them in the vector database. A text embedding is a compressed, abstract representation of text data where text of arbitrary length can be represented as a vector of numbers. Embedding is usually learnt from a corpus of text data such as Wikipedia. Think of them as a universal encoding for text, where text with similar content will have similar vectors. We can now use this vector store to find the relevant chunks by doing the similarity search on vector store. Finally we can use the relevant information to create a prompt with customer query and pass that prompt to llm to get the desired result.

Implementing RAG

In the blog we will use LangChain — which is excellent open source developer framework for building LLM applications. It provides abstraction to hide all the complexity for building LLM application and provide very easy to use simpler interfaces using python and java-script library. It also provides various integration point for other library/system for document loading, vector stores, calling various LLMs using API and loading the LLM model from Hugging Face model hub. We will also use Hugging Face, it is a platform where machine learning community collaborates on models, datasets and applications. We will use Hugging Face to download one of the open source LLM model FLAN_T5 from Google and sentence transformer all-MiniLM-L6-v2 model to get the result from these LLMs. We will load these from the local machine. You can easily download these models from Hugging Face by cloning the model repository.It will help you out to run this code without internet or in very constrained environment. Downloading model can take time depending on your network speed.

git lfs install 
git clone https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 
git clone https://huggingface.co/google/flan-t5-large

We will use python code for illustration purpose. You can use this code for your applications.You can also refer LangChain site for more code references. You also need to install below required python library to run the code.

pip install langchain
pip install torch
pip install transformers
pip install faiss-cpu
pip install pypdf
pip install sentence-transformers

RAG Process

We can follow below process to understand the RAG —

1-) We can load the corpus from multiple sources. Corpus can be in multiple form like PDFs, Microsoft Words, online or corporate Wiki. LangChain provides different document loader to load the data from different sources likes PDFs, CSV, File directory, HTML, JSON, Markdown.

from langchain.document_loaders import PyPDFLoader
pdfLoader = PyPDFLoader("example_data/Large_language_model.pdf")
documents = pdfLoader.load()

2-) Once the document is loaded into memory we can divide them into smaller chunks. It sounds easy but is tricky to divide the document without loosing relationships between the chunks. LangChain provides different types of Text splitters like Split by character, Split code, MarkdownHeaderTextSplitter, Recursively split by character, Split by tokens.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(documents)

3–) We can now create the embedding from these docs. Embedding creates a vector representation of a piece of text. Embedding represent every docs in a vector space and similar docs will have similar vectors. It will help us to find the docs based on the user query. We can easily do semantic search (similarity search ) where we look for pieces of text that are most similar in the vector space. There are lots of embedding model providers (OpenAI, Cohere, Hugging Face, etc) which has integration with LangChain. We will use one of the sentence transformer model all-miniLM-L6-v2 and will load this from local. This model maps sentences & paragraphs to a 384 dimensional dense vector space and can be used for tasks like clustering or semantic search.

from langchain.embeddings import HuggingFaceEmbeddings
modelPath = "/model/sentence-transformer/all-MiniLM-L6-v2"
model_kwargs = {'device':'cpu'}
encode_kwargs = {'normalize_embeddings':False}
embeddings = HuggingFaceEmbeddings(
  model_name = modelPath,  
  model_kwargs = model_kwargs,
  encode_kwargs=encode_kwargs
)

4-) A vector store provides the functionality to store vectoring data and functionality for similarity search in that vector data. LangChain provides integration with many free, open-source vector stores that can run entirely on your local machine. We will use FAISS (Facebook AI Similarity Search) vector stores from the @Facebook AI research to enable efficient similarity search.

from langchain.vectorstores import FAISS
db = FAISS.from_documents(docs, embeddings)
question = "How many weights llm can contain?"
searchDocs = db.similarity_search(question)
print(searchDocs[0].page_content)

5-) We need a LLM where we can pass similar documents with the customer query and retrieve answer. We will use open source FLAN-T5-LARGE model from Hugging Face and load from local. It is a good encode-decoder instruct model. It shows good capability in many tasks.We will also pass this model to LangChain so it can be used with LangChain chain API.


from transformers import AutoTokenizer, AutoModelForSeq2SeqLM,pipeline
from langchain import HuggingFacePipeline

tokenizer = AutoTokenizer.from_pretrained("model/google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("model/google/flan-t5-large")
pipe = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
llm = HuggingFacePipeline(
    pipeline = pipeline,
    model_kwargs={"temperature": 0, "max_length": 512},
)

6-) We can build a prompt for LLM with using LangChain prompt template.

from langchain.prompts import PromptTemplate

template = """Use the following pieces of context to answer the question at the end. If you don't know the answer, just say that you don't know, don't try to make up an answer. Keep the answer as concise as possible. 
{context}
Question: {question}
Helpful Answer:"""
QA_CHAIN_PROMPT = PromptTemplate.from_template(template)

7-) LangChain provides the Chain interface for “chained” applications. For example, we can create a chain that takes user input, formats it with a PromptTemplate, and then passes the formatted response to the LLM. We can use LangChain RetrieverQA chain. It does all heavy lifting for us e.g it retrieves relevant documents from the vector stores using vector store retriever and stuffed docs together to fit into LLM context and pass this info with user’s query to llm using prompt to get the result and format that. Now we will get the answer from our documents using LLM. :-)

from langchain.chains import RetrievalQA 
qa_chain = RetrievalQA.from_chain_type(   
  llm=llm,   
  chain_type="stuff",   
  retriever=db.as_retriever(),   
  chain_type_kwargs={"prompt": QA_CHAIN_PROMPT} 
) 
result = qa_chain ({ "query" : question }) 
print(result["result"])

8-) We can call this QA chain from API to serve our applications.

Finally we need a way for any LIVE updates in our corpus that should trigger a event to update the vector store with up-to-date vector indices.

Will cover fine-tuning in future article. :-)