Building a Document-based Question Answering System with LangChain using Open Source LLM model (LLaMA 2 ).

4 min readAug 28, 2023

Here using LLM Model as LLaMA 2 and Vector Store as FAISS with LangChain framework.

LLaMA 2 model is pretrained and fine-tuned with 2 Trillion 🚀 tokens and 7 to 70 Billion parameters which makes it one of the powerful open source models. It comes in three different model sizes (i.e. 7B, 13B and 70B) with significant improvements over the Llama 1 models, including being trained on 40% more tokens, having a much longer context length (4k tokens 🤯), and using grouped-query attention for fast inference of the 70B model 🔥. It outperforms other open source LLMs on many external benchmarks, including reasoning, coding, proficiency, and knowledge tests.

FAISS (Facebook AI Similarity Search) stands as a sophisticated library designed to facilitate the seamless exploration of efficient similarity search and the clustering of dense vectors. This dynamic toolkit empowers the exploration of multimedia documents, such as images, in manners that transcend the efficiency limitations of conventional database engines like SQL. FAISS encapsulates advanced algorithms tailored for searching within collections of vectors of varying sizes, including those that may exceed the confines of available RAM. Complementing its core functionalities, FAISS incorporates ancillary code dedicated to comprehensive evaluation and meticulous parameter fine-tuning.

Q&A Flow:

Code Explanation:

In this section, I will go through the code to explain you each step in detail.

Getting Started

You can use the open source Llama-2–7b-chat model in both Hugging Face transformers and LangChain. However, you have to first request access to Llama 2 models via Meta website and also accept to share your account details with Meta on Hugging Face website. It typically takes a few minutes or hours to get the access. LLaMA 2 contains base models and chat models.

Note: that your Hugging Face account email MUST match the email you provided on the Meta website, or your request will not be approved.

If you’re using Google Colab to run the code. In your notebook, go to Runtime > Change runtime type > Hardware accelerator > GPU > GPU type > T4. You will need ~8GB of GPU RAM for inference and running on CPU is practically impossible.

In below Code we are using everything is Open Source (i.e LLM Model, Embedding method, Vector Store).

import os
from langchain.vectorstores import FAISS
from langchain.document_loaders import PyPDFLoader
from langchain.chains.question_answering import load_qa_chain
from langchain.prompts import PromptTemplate
from langchain.memory import ConversationBufferMemory
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.chains import RetrievalQA
from langchain.document_loaders import UnstructuredFileLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQAWithSourcesChain
from huggingface_hub import notebook_login
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForCausalLM
from langchain import HuggingFacePipeline
from langchain.text_splitter import CharacterTextSplitter
import textwrap
import sys
import os

Initialize hugging face aunthetication using access token.

import os
os.environ['HuggingFaceHub_API_Token']= 'mention_your_key_here'

the code reads an unstructured PDF document named ‘file.pdf’, divides its content into smaller text chunks, and organizes them for subsequent analysis or processing.

loader = UnstructuredFileLoader('file.pdf')
documents = loader.load()

text_splitter=CharacterTextSplitter(separator='\n',
                                    chunk_size=1000,
                                    chunk_overlap=50)
text_chunks=text_splitter.split_documents(documents)

Using Open Source embedded method from hugging face. we can change any other methods from hugging face embedded models.

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2',model_kwargs={'device': 'cuda'})

Using Vector store to store the embeddings in a FAISS vectore store.

vectorstore=FAISS.from_documents(text_chunks, embeddings)

Call the LLM model from hugging face meta-ai here i am using meta-llama/Llama-2–7b-chat-hf.

import torch
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
                                             device_map='auto',
                                             torch_dtype=torch.float16,
                                             use_auth_token=True,
                                             load_in_8bit=True,
                                              #load_in_4bit=True
                                             )

pipe = pipeline("text-generation",
                model=model,
                tokenizer= tokenizer,
                torch_dtype=torch.bfloat16,
                device_map="auto",
                max_new_tokens = 1024,
                do_sample=True,
                top_k=10,
                num_return_sequences=1,
                eos_token_id=tokenizer.eos_token_id
                )

llm=HuggingFacePipeline(pipeline=pipe, model_kwargs={'temperature':0})

You have to initialize RetrievalQAChain. This chain allows you to have a chatbot while relying on a vector store to find relevant information from your document.

Additionally, you can return the source documents used to answer the question by specifying an optional parameter i.e. return_source_documents=True when constructing the chain.

chain =  RetrievalQA.from_chain_type(llm=llm, chain_type = "stuff",return_source_documents=True, retriever=vectorstore.as_retriever())

Here you can ask questions based on given context, llm model gives answer based on the given context exctracted from vector database.

query = "question?"
result=chain({"query": query}, return_only_outputs=True)
wrapped_text = textwrap.fill(result['result'], width=500)
wrapped_text

In conclusion, the LangChain Question Answering powered by the Open Source Llama 2 Model from Facebook AI is a groundbreaking achievement in natural language processing, offering a versatile tool for accurate and contextually rich question answering. By embracing open-source principles, LangChain ensures widespread access to this advanced technology, with the potential to reshape information interaction. While challenges remain, including model refinement and ethical considerations, this project signifies a significant leap forward, highlighting the synergy between AI advancement and human-driven innovation in the domain of language understanding. I extend my heartfelt thanks for joining me on this exploration, and i look forward to witnessing the continued fusion of AI and human ingenuity. Thanks for reading!

Building a Document-based Question Answering System with LangChain using Open Source LLM model (LLaMA 2 ).

Q&A Flow:

Code Explanation:

Getting Started

Written by Nagesh Mashette