Building custom question-answering app using LangChain and Pinecone vector database

Avikumar Talaviya

Published in

AI Science

10 min readAug 29, 2023

Build a custom chatbot to develop Q&A applications from any data sources using LangChain, OpenAI, and PineconeDB

Image credits: Mirantha Jayathilaka, PhD

Introduction

The advent of large language models is one of the most exciting technological developments of our time. It has opened up endless possibilities in the field of artificial intelligence, offering solutions to real-world problems across various industries. One of the most fascinating applications of these models is developing custom question-answering or chatbots that draw from either personal or organisational data sources. However, since LLMs are trained on general data available publicly, their answers may not always be specific or useful to the end user. To solve this issue, we can use frameworks such as LangChain to develop custom chatbots that provide specific answers based on our data. In this article, we will learn how to build custom Q&A applications with deployment on the Streamlit Cloud. So let’s get started!

(Note: This article is derived from a published article on analytics vidya — Link)

Learning objectives:

Learn why a custom question-answering application is better than fine-tuning language model
Learn to develop a semantic search pipeline with OpenAI and Pinecone
Develop a custom Q&A application and deploy it on the streamlit cloud.

Overview of question-answering application

Question-answering or “chat over your data” is a popular use case of LLMs and LangChain. LangChain provides a series of components to load any data sources you can find for your use case. It supports a vast number of data sources and transformers to convert into a series of strings to store in vector databases. Once the data is stored in a database one can query the database using components called retrievers. Moreover, by using LLMs we can get accurate answers like chatbots without juggling through tons of documents.

LangChain supports the following data sources. As you can see in the image it allows over 120 integrations to connect every data source you may have.

Question-answering application workflow

We learned about the data sources that are supported by LangChain which allows us to develop a question-answering pipeline using components available in LangChain. Below are the components which are used in document loading, storage, retrieval, and generating output by LLM.

Document loaders: To load user documents for vectorization and storage purposes
Text splitters: These are the document transformers that transform documents into fixed chunk lengths to store them efficiently
Vector storage: Vector database integrations to store vector embeddings of the input texts
Document retrieval: To retrieve texts based on user queries to the database. They use similarity search techniques to retrieve the same.
Model output: Final model output to the user query generated from the input prompt of query and retrieved texts.

This is the high-level workflow of the question-answering pipeline which can solve many types of real-world problems. I haven’t gone deep into each LangChain Component but if you are looking to learn more about it then check out my previous article published on Analytics Vidhya (Link: Click Here)

Advantages of custom Q&A over a Model fine-tuning

Context-specific answers
Adaptable to new input documents
No need to fine-tune the model which saves the cost of model training
More accurate and specific answers rather than general answers

What is a Pinecone vector database?

Pinecone is a popular vector database used in building LLM-powered applications. It is versatile and scalable for high-performance AI applications. It’s a fully managed, cloud-native vector database with no infrastructure hassles from users.

LLM bases applications involve large amounts of unstructured data which require sophisticated long-term memory to retrieve information with maximum accuracy. Generative AI applications rely on semantic search on vector embeddings to return suitable context based on user input.

Pinecone is very well suited for such applications and optimized to store and query a large number of vectors with low latency to build user-friendly applications. Let’s learn how to set up a pinecone vector database for our question-answering application.

# install pinecone-client
pip install pinecone-client

# import pinecone and initialize with your API key and environment name
import pinecone
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")

# create your first index to get started with storing vectors 
pinecone.create_index("first_index", dimension=8, metric="cosine")

# Upsert sample data (5 8-dimensional vectors)
index.upsert([
    ("A", [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1]),
    ("B", [0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2, 0.2]),
    ("C", [0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3, 0.3]),
    ("D", [0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4, 0.4]),
    ("E", [0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5, 0.5])
])

# Use list_indexes() method to call a number of indexes available in db
pinecone.list_indexes()

[Output]>>> ['first_index']

In the above demonstration, we install a pinecone client to initialize a vector database in our project environment. Once the vector database is initialized we can create an index with the required dimension and metric to insert vector embeddings into the vector database. In the next section, we will develop a semantic search pipeline using Pinecone and LangChain for our application.

Building a semantic search pipeline using OpenAI and Pinecone

We learned that there are 5 steps in the question-answering application workflow. In this section, we will perform the first 4 steps which are document loaders, text splitters, vector storage, and document retrieval.

To perform these steps in your local environment or cloud bases notebook environment like Google Colab, you need to install some libraries and create an account on OpenAI and Pinecone to obtain their API keys respectively. Let’s start with the environment setup:

Installing required libraries

# install langchain and openai with other dependencies
!pip install --upgrade langchain openai -q
!pip install pillow==6.2.2
!pip install unstructured -q
!pip install unstructured[local-inference] -q
!pip install detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.6#egg=detectron2 -q
!apt-get install poppler-utils
!pip install pinecone-client -q
!pip install tiktoken -q


# setup openai environment
import os
os.environ["OPENAI_API_KEY"] = "YOUR-API-KEY"

# importing libraries
import os
import openai
import pinecone
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Pinecone
from langchain.llms import OpenAI
from langchain.chains.question_answering import load_qa_chain

Once the installation setup is done, import all the libraries as mentioned in the above code snippet. Then, follow the next steps below:

Load the documents

In this step, we will load the documents from the directory as a starting point for AI project pipeline. we have 2 documents in our directory which we will load into our project environment.

#load the documents from content/data dir
directory = '/content/data'

# load_docs functions to load documents using langchain function
def load_docs(directory):
  loader = DirectoryLoader(directory)
  documents = loader.load()
  return documents

documents = load_docs(directory)
len(documents)
[Output]>>> 5

Split the texts data

Text embeddings and LLMs perform better if each document is of fixed length. Thus, Splitting texts into equal lengths of chunks is necessary for any LLM use case. we will use ‘RecursiveCharacterTextSplitter’ to convert documents into the same size as text documents.

# split the docs using recursive text splitter
def split_docs(documents, chunk_size=200, chunk_overlap=20):
  text_splitter = RecursiveCharacterTextSplitter(chunk_size=chunk_size, chunk_overlap=chunk_overlap)
  docs = text_splitter.split_documents(documents)
  return docs

# split the docs
docs = split_docs(documents)
print(len(docs))
[Output]>>>12

Store the data in vector storage

Once the documents are split, we will store their embeddings in the vector database Using OpenAI embeddings.

# embedding example on random word
embeddings = OpenAIEmbeddings()

# initiate pinecondb
pinecone.init(
    api_key="YOUR-API-KEY",
    environment="YOUR-ENV"
)

# define index name
index_name = "langchain-project"

# store the data and embeddings into pinecone index
index = Pinecone.from_documents(docs, embeddings, index_name=index_name)

Retrieve data from the vector database

At this stage, we will retrieve the documents using a semantic search from our vector database. we have vectors stored in an index called “langchain-project” and once we query to the same as below we would get most similar documents from the database.

# An example query to our database
query = "What are the different types of pet animals are there?"

# do a similarity search and store the documents in result variable 
result = index.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)
-
--------------------------------[Output]--------------------------------------
result
[Document(page_content='Small mammals like hamsters, guinea pigs, 
and rabbits are often chosen for their
low maintenance needs. Birds offer beauty and song,
and reptiles like turtles and lizards can make intriguing pets.', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
 Document(page_content='Pet animals come in all shapes and sizes, each suited 
to different lifestyles and home environments. Dogs and cats are the most 
common, known for their companionship and unique personalities. Small', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'}),
 Document(page_content='intriguing pets. Even fish, with their calming presence
, can be wonderful pets.', 
metadata={'source': '/content/data/Different Types of Pet Animals.txt'})]

We can retrieve the documents based on similarity search from the vector store as shown in above code snippet. If you are looking to learn more about semantic search applications. I highly recommend to read my previous article on this topic (link: click here)

Custom question-answering application with streamlit

In the final stage of the question-answering application, we will integrate every component of the workflow to build custom Q&A application that allows users to input various data sources like web-based articles, PDFs, CSVs, etc to chat with it. thus making them productive in their daily activities. We need to create a GitHub repository and add following files into it.

Project files to be added:

main.py — A python file containing streamlit front-end code
qanda.py — Prompt design and Model output function to return an answer to users’ query
utils.py — Utility functions to load and split input documents
vector_search.py — Text embeddings and Vector storage function
requirements.txt — Project dependencies to run the application in streamlit public cloud

We are supporting two types of data sources in this project demonstration:

Web URL based text data
Online PDF files

These two types contain wide range of text data and most frequent for many use-cases. you can see main.py python code below to understand user interface of the app.

# import necessary libraries
import streamlit as st
import openai
import qanda
from vector_search import *
from utils import *
from io  import StringIO

# take openai api key in
api_key = st.sidebar.text_input("Enter your OpenAI API key:", type='password')
# open ai key
openai.api_key = str(api_key)

# header of the app
_ , col2,_ = st.columns([1,7,1])
with col2:
    col2 = st.header("Simplchat: Chat with your data")
    url = False
    query = False
    pdf = False
    data = False
    # select option based on user need
    options = st.selectbox("Select the type of data source",
                            options=['Web URL','PDF','Existing data source'])
    #ask a query based on options of data sources
    if options == 'Web URL':
        url = st.text_input("Enter the URL of the data source")
        query = st.text_input("Enter your query")
        button = st.button("Submit")
    elif options == 'PDF':
        pdf = st.text_input("Enter your PDF link here") 
        query = st.text_input("Enter your query")
        button = st.button("Submit")
    elif options == 'Existing data source':
        data= True
        query = st.text_input("Enter your query")
        button = st.button("Submit") 

# write code to get the output based on given query and data sources   
if button and url:
    with st.spinner("Updating the database..."):
        corpusData = scrape_text(url)
        encodeaddData(corpusData,url=url,pdf=False)
        st.success("Database Updated")
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)

# write a code to get output on given query and data sources
if button and pdf:
    with st.spinner("Updating the database..."):
        corpusData = pdf_text(pdf=pdf)
        encodeaddData(corpusData,pdf=pdf,url=False)
        st.success("Database Updated")
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)
        
if button and data:
    with st.spinner("Finding an answer..."):
        title, res = find_k_best_match(query,2)
        context = "\n\n".join(res)
        st.expander("Context").write(context)
        prompt = qanda.prompt(context,query)
        answer = qanda.get_answer(prompt)
        st.success("Answer: "+ answer)
        
        
# delete the vectors from the database
st.expander("Delete the indexes from the database")
button1 = st.button("Delete the current vectors")
if button1 == True:
    index.delete(deleteAll='true')

To check other code files, please visit the GitHub repository of the project. (Link: Click Here)

Deployment of the Q&A app on streamlit cloud

Streamlit provides a community cloud to host applications for free of cost. Moreover, streamlit is easy to use due to its automated CI/CD pipeline features. To learn more about streamlit to build apps — Please visit their my previous article I wrote on Analytics Vidya (Link: Click Here)

Conclusion

In conclusion, we have explored the exciting possibilities of building a custom question-answering application using LangChain and the Pinecone vector database. This blog has taken us through the fundamental concepts, starting from an overview of the question-answering application to understanding the capabilities of the Pinecone vector database. By combining the power of OpenAI’s semantic search pipeline with Pinecone’s efficient indexing and retrieval system, we have harnessed the potential to create a robust and accurate question-answering solution with streamlit.

FAQs

Q1: What is pinecone and LangChain?

A: Pinecone is a scalable long-term memory vector database to store text embeddings for LLM powered application while LangChain is a framework that allows developers to build LLM powered applications

Q2: What is the application of NLP question answering?

A: Question-answering applications are used in customer support chatbot, academic research, e-Learning, etc.

Q3: Why should I use LangChain?

A: Working with LLMs can be complicated. LangChain allows developers to user various components to integrate these LLMs in the most developers friendly way possible thus shipping products faster.

Q4: What are the steps to build an Q&A application?

A: Steps to build Q&A application are as follow: Document loading, text splitter, vector storage, retrieval, and model output.

Q5: What are LangChain tools?

A: LangChain has following tools: Document loaders, Document transformers, Vector store, Chains, Memory, and Agents.